date:20210114

Re: Is it worth accepting multiple CRLs?

2021-01-14 Thread Peter Eisentraut


On 2020-08-31 11:03, Kyotaro Horiguchi wrote:

At Tue, 18 Aug 2020 16:43:47 +0900 (JST), Kyotaro Horiguchi 
 wrote in

Thank you very much. I'll do that after some polishing.

A near-by discussion about OpenSSL3.0 conflicts with this but it's
easy to follow.


Rebased. Fixed bogus tests and strange tentative API change of
SSLServer.pm.  Corrected a (maybe) spelling mistake.  I'm going to
register this to the coming CF.


Other systems that offer both a CRL file and a CRL directory usually 
specify those using two separate configuration settings.  Examples:


https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_ssl_crlpath
https://httpd.apache.org/docs/current/mod/mod_ssl.html#sslcarevocationpath

These are then presumably both passed to X509_STORE_load_locations(), 
which supports specifying a file and directory concurrently.


I think that would be a preferable approach.  In practical terms, it 
would allow a user to introduce the directory method gradually without 
having to convert the existing CRL file at the same time.

Re: Asynchronous Append on postgres_fdw nodes.

2021-01-14 Thread Kyotaro Horiguchi

At Sat, 19 Dec 2020 17:55:22 +0900, Etsuro Fujita  
wrote in 
> On Mon, Dec 14, 2020 at 4:01 PM Kyotaro Horiguchi
>  wrote:
> > At Sat, 12 Dec 2020 18:25:57 +0900, Etsuro Fujita  
> > wrote in
> > > On Fri, Nov 20, 2020 at 3:51 PM Kyotaro Horiguchi
> > >  wrote:
> > > > The reason for
> > > > the early fetching is letting fdw send the next request as early as
> > > > possible. (However, I didn't measure the effect of the
> > > > nodeAppend-level prefetching.)
> > >
> > > I agree that that would lead to an improved efficiency in some cases,
> > > but I still think that that would be useless in some other cases like
> > > SELECT * FROM sharded_table LIMIT 1.  Also, I think the situation
> > > would get worse if we support Append on top of joins or aggregates
> > > over ForeignScans, which would be more expensive to perform than these
> > > ForeignScans.
> >
> > I'm not sure which gain we weigh on, but if doing "LIMIT 1" on Append
> > for many times is more common than fetching all or "LIMIT  > multiples of fetch_size>", that discussion would be convincing... Is
> > it really the case?
> 
> I don't have a clear answer for that...  Performance in the case you
> mentioned would be improved by async execution without prefetching by
> Append, so it seemed reasonable to me to remove that prefetching to
> avoid unnecessary overheads in the case I mentioned.  BUT: I started
> to think my proposal, which needs an additional FDW callback routine
> (ie, ForeignAsyncBegin()), might be a bad idea, because it would
> increase the burden on FDW authors.

I agree on the point of developers' burden.

> > > If we do prefetching, I think it would be better that it’s the
> > > responsibility of the FDW to do prefetching, and I think that that
> > > could be done by letting the FDW to start another data fetch,
> > > independently of the core, in the ForeignAsyncNotify callback routine,
> >
> > FDW does prefetching (if it means sending request to remote) in my
> > patch, so I agree to that.  It suspect that you were intended to say
> > the opposite.  The core (ExecAppendAsyncGetNext()) controls
> > prefetching in your patch.
> 
> No.  That function just tries to retrieve a tuple from any of the
> ready subplans (ie, subplans marked as as_needrequest).

Mmm. I meant that the function explicitly calls
ExecAppendAsyncRequest(), which finally calls fetch_more_data_begin()
(if needed). Conversely if the function dosn't call
ExecAppendAsyncRequsest, the next request to remote doesn't
happen. That is, after the tuple buffer of FDW-side is exhausted, the
next request doesn't happen until executor requests for the next
tuple. You seem to be saying that "postgresForeignAsyncRequest() calls
fetch_more_data_begin() following its own decision."  but this doesn't
seem to be "prefetching".

> > > which I revived from Robert's original patch.  I think that that would
> > > be more efficient, because the FDW would no longer need to wait until
> > > all buffered tuples are returned to the core.  In the WIP patch, I
> >
> > I don't understand. My patch sends a prefetch-query as soon as all the
> > tuples of the last remote-request is stored into FDW storage.  The
> > reason for removing ExecAsyncNotify() was it is just redundant as far
> > as concerning Append asynchrony. But I particulary oppose to revive
> > the function.
> 
> Sorry, my explanation was not good, but what I'm saying here is about
> my patch, not your patch.  I think this FDW callback routine would be
> useful; it allows an FDW to perform another asynchronous data fetch
> before delivering a tuple to the core as discussed in [1].  Also, it
> would be useful when extending to the case where we have intermediate
> nodes between an Append and a ForeignScan such as joins or aggregates,
> which I'll explain below.

Yeah. If a not-immediate parent of an async-capable node works as
async-aware, the notify API would have the power. So I don't object to
the API.

> > > only allowed the callback routine to put the corresponding ForeignScan
> > > node into a state where it’s either ready for a new request or needing
> > > a callback for another data fetch, but I think we could probably relax
> > > the restriction so that the ForeignScan node can be put into another
> > > state where it’s ready for a new request while needing a callback for
> > > the prefetch.
> >
> > I don't understand this, too. ExecAsyncNotify() doesn't touch any of
> > the bitmaps, as_needrequest, callback_pending nor as_asyncpending in
> > your patch.  Am I looking into something wrong?  I'm looking
> > async-wip-2020-11-17.patch.
> 
> In the WIP patch I post, these bitmaps are modified in the core side
> based on the callback_pending and request_complete flags in
> AsyncRequests returned from FDWs (See ExecAppendAsyncEventWait()).

Sorry. I think I misread you here. I agree that, the notify API is not
so useful now but would be useful if we allow notify descendents other
than immediate children. However, I stumbled on th

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread japin



On Fri, 15 Jan 2021 at 14:50, Bharath Rupireddy 
 wrote:
> On Fri, Jan 15, 2021 at 11:33 AM Hou, Zhijie  
> wrote:
>>
>> > On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
>> > > Do we really need to access PUBLICATIONRELMAP in this patch? What if
>> > > we just set it to false in the else condition of (if (publish &&
>> > > (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
>> > >
>> > > Thank for you review. I agree with you, it doesn’t need to access
>> > > PUBLICATIONRELMAP, since We already get the publication oid in
>> > > GetRelationPublications(relid) [1], which also access
>> > > PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the
>> > publish is false, so we do not need publish the table.
>> >
>> > +1. This is enough.
>> >
>> > > I have another question, the data->publications is a list, when did it
>> > has more then one items?
>> >
>> > IIUC, when the single table is associated with multiple publications, then
>> > data->publications will have multiple entries. Though I have not tried,
>> > we can try having two or three publications for the same table and verify
>> > that.
>> >
>> > > 0001 - change as you suggest.
>> > > 0002 - does not change.
>>
>>
>> Hi
>>
>> In 0001 patch
>>
>> The comments says " Relation is not associated with the publication anymore "
>> That comment was accurate when check is_relation_part_of_publication() in 
>> the last patch.
>>
>> But when put the code in the else branch, not only the case in the 
>> comments(Relation is dropped),
>> Some other case such as "partitioned tables" will hit else branch too.
>>
>> Do you think we need fix the comment a little to make it accurate?
>
> Thanks, that comment needs a rework, in fact the else condition
> also(because we may fall into else branch even when publish is true
> and (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot) is
> false, but our intention(to fix the bug reported here) is to have the
> else condition only when publish is false. And also instead of just
> relying on the publish variable which can get set even when the
> relation is not in the publication but ancestor_published is true, we
> can just have something like below to fix the bug reported here:
>
> if (publish &&
> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot))
> {
> entry->pubactions.pubinsert |= pub->pubactions.pubinsert;
> entry->pubactions.pubupdate |= pub->pubactions.pubupdate;
> entry->pubactions.pubdelete |= pub->pubactions.pubdelete;
> entry->pubactions.pubtruncate |= pub->pubactions.pubtruncate;
> }
> else if (!list_member_oid(pubids, pub->oid))
> {
> /*
>  * Relation is not associated with the publication anymore i.e
>  * it would have been dropped from the publication. So we 
> don't
>  * need to publish the changes for it.
>  */
> entry->pubactions.pubinsert = false;
> entry->pubactions.pubupdate = false;
> entry->pubactions.pubdelete = false;
> entry->pubactions.pubtruncate = false;
> }
>
> So this way, the fix only focuses on what we have reported here and it
> doesn't cause any regressions may be, and the existing comment becomes
> appropriate.
>
> Thoughts?
>

I'm sorry, the 0001 patch is totally wrong.  If we have only one publication, 
it works,
however, this is a special case.  If we have multiple publication in one 
subscription,
it doesn't work.  Here is a usecase.

Step (1)
-- On publisher
CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
CREATE PUBLICATION mypub1 FOR TABLE t1 WITH (publish='insert');
CREATE PUBLICATION mypub2 FOR TABLE t2;

Step (2)
-- On subscriber
CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
CREATE SUBSCRIPTION mysub CONNECTION 'host=localhost port=5432 dbname=postgres' 
PUBLICATION mypub1, mypub2;

Step (3)
-- On publisher
INSERT INTO t1 VALUES (1);

Step (4)
-- On subscriber
SELECT * FROM t1;

In step (4), we cannot get the records, because there has two publications in
data->publications, so we will check one by one.
The first publication is "mypub1" and entry->pubactions.pubinsert will be set
true, however, in the second round, the publication is "mypub2", and we cannot
find pub->oid in pubids (only oid of "mypub1"), so it will set all 
entry->pubactions
to false, and nothing will be replicate to subscriber.

My apologies!

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Wrong HINT during database recovery when occur a minimal wal.

2021-01-14 Thread lchch1...@sina.cn


Sorry, I don't known why it showed in wrong format, and try to correct it.
-

When I do PITR in a strange step, I get this FATAL:

2021-01-15 15:02:52.364 CST [14958] FATAL:  hot standby is not possible because 
wal_level was not set to "replica" or higher on the primary server
2021-01-15 15:02:52.364 CST [14958] HINT:  Either set wal_level to "replica" on 
the primary, or turn off hot_standby here.

The strange step is that I change wal_level to minimal after basebackup.

My question is that what's the mean of  [set wal_level to "replica" on the 
primary] in
HINT describe, I can't think over a case to solve this FATAL by set wal_level, 
I can
solve it by turn off hot_standby only.

Do you think we can do this code change?
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6300,7 +6300,7 @@ CheckRequiredParameterValues(void)
  if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
  ereport(ERROR,
  (errmsg("hot standby is not possible because wal_level was not set to 
\"replica\" or higher on the primary server"),
-  errhint("Either set wal_level to \"replica\" on the primary, or turn off 
hot_standby here.")));
+  errhint("You should turn off hot_standby here.")));



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

Re: WIP: document the hook system

2021-01-14 Thread Peter Eisentraut


On 2020-12-31 04:28, David Fetter wrote:

This could probably use a lot of filling in, but having it in the
actual documentation beats needing to know folklore even to know
that the capability is there.


This patch seems quite short of a state where one could begin to 
evaluate it.  Documenting the hooks better seems a worthwhile goal.   I 
think the question is whether we can develop documentation that is 
genuinely useful by itself without studying the relevant source code. 
This submission does not address that question.

Wrong HINT during database recovery when occur a minimal wal.

2021-01-14 Thread lchch1990

Hello hackers,
When I do PITR in a strange step, I get this FATAL:
2021-01-15 15:02:52.364 CST [14958] FATAL:  hot standby is not possible because 
wal_level was not set to "replica" or higher on the primary server2021-01-15 
15:02:52.364 CST [14958] HINT:  Either set wal_level to "replica" on the 
primary, or turn off hot_standby here.
The strange step is that I change wal_level to minimal after basebackup.
My question is that what's the mean of  [set wal_level to "replica" on the 
primary] inHINT describe, I can't think over a case to solve this FATAL by set 
wal_level, I cansolve it by turn off hot_standby only.
Do you think we can do this code change?--- 
a/src/backend/access/transam/xlog.c+++ b/src/backend/access/transam/xlog.c@@ 
-6300,7 +6300,7 @@ CheckRequiredParameterValues(void)  if 
(ControlFile->wal_level < WAL_LEVEL_REPLICA) 
ereport(ERROR,  (errmsg("hot standby is not 
possible because wal_level was not set to \"replica\" or higher on the primary 
server"),-errhint("Either set wal_level to 
\"replica\" on the primary, or turn off hot_standby here.")));+ 
   errhint("You should turn off hot_standby here.")));

---Regards,Highgo Software (Canada/China/Pakistan) URL : www.highgo.ca EMAIL: 
mailto:movead(dot)li(at)highgo(dot)ca

Re: [patch] [doc] Further note required activity aspect of automatic checkpoint and archving

2021-01-14 Thread Peter Eisentraut


On 2020-10-12 23:54, David G. Johnston wrote:

--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -722,6 +722,8 @@ test ! -f 
/mnt/server/archivedir/000100A90065 && cp pg_wal/0
      short archive_timeout — it will bloat 
your archive
      storage.  archive_timeout settings of a minute 
or so are

      usually reasonable.
+    This is mitigated by the fact that empty WAL segments will not be 
archived

+    even if the archive_timeout period has elapsed.
     


This is hopefully not what happens.  What this would mean is that I'd 
then have a sequence of WAL files named, say,


1, 2, 3, 7, 8, ...

because a few in the middle were not archived because they were empty.


--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3131,6 +3131,8 @@ include_dir 'conf.d'
        
         
          Maximum time between automatic WAL checkpoints.
+        The automatic checkpoint will do nothing if no new WAL has been
+        written since the last recorded checkpoint.
          If this value is specified without units, it is taken as seconds.
          The valid range is between 30 seconds and one day.
          The default is five minutes (5min).


I think what happens is that the checkpoint is skipped, not that the 
checkpoint happens but does nothing.  That is the wording you cited in 
the other thread from 
.

Re: Add Nullif case for eval_const_expressions_mutator

2021-01-14 Thread Peter Eisentraut


On 2021-01-12 07:43, Hou, Zhijie wrote:

I think this patch should be about a tenth the size.  Try modeling it
on the T_SubscriptingRef-etc case, ie, use ece_generic_processing and
then ece_evaluate_expr to cover the generic cases.  OpExpr is common
enough to deserve specially optimized code, but NullIf isn't, so shorter

is better.

Thanks for the review.

Attaching v2 patch , which followed the suggestion to use
ece_generic_processing and ece_evaluate_expr to simplify the code.

Please have a check.


Sorry, I found the code still be simplified better.
Attaching the new patch.


It's a bit unfortunate now that between OpExpr, DistinctExpr,  
NullIfExpr, and to a lesser extent ScalarArrayOpExpr we will now have  
several different implementations of nearly the same thing, without any  
explanation why one approach was chosen here and another there.  We  
should at least document this.


Some inconsistencies I found: The code for DistinctExpr calls  
expression_tree_mutator() directly, but your code for NullIfExpr calls  
ece_generic_processing(), even though the explanation in the comment for  
DistinctExpr would apply there as well.


Your code for NullIfExpr doesn't appear to call set_opfuncid() anywhere.

I would move your new block for NullIfExpr after the block for  
DistinctExpr.  That's the order in which these blocks appear elsewhere  
for generic node processing.


Check your whitespace usage:

if(!has_nonconst_input)

should have a space after the "if".  (It's easy to fix of course, but I  
figure I'd point it out here since you have submitted several patches  
with this style, so it's perhaps a habit to break.)


Perhaps add a comment to the tests like this so it's clear what they are  
for:


diff --git a/src/test/regress/sql/case.sql b/src/test/regress/sql/case.sql
index 4742e1d0e0..98e3fb8de5 100644
--- a/src/test/regress/sql/case.sql
+++ b/src/test/regress/sql/case.sql
@@ -137,6 +137,7 @@ CREATE TABLE CASE2_TBL (
   FROM CASE_TBL a, CASE2_TBL b
   WHERE COALESCE(f,b.i) = 2;

+-- Tests for constant subexpression simplification
 explain (costs off)
 SELECT * FROM CASE_TBL WHERE NULLIF(1, 2) = 2;

fdatasync(2) on macOS

2021-01-14 Thread Thomas Munro

Hello hackers,

While following along with the nearby investigation into weird
cross-version Apple toolchain issues that confuse configure, I noticed
that the newer buildfarm Macs say:

checking for fdatasync... (cached) yes

That's a bit strange because there's no man page and no declaration:

checking whether fdatasync is declared... (cached) no

That's no obstacle for us, because c.h does:

#if defined(HAVE_FDATASYNC) && !HAVE_DECL_FDATASYNC
extern int  fdatasync(int fildes);
#endif

So... does this unreleased function flush drive caches?  We know that
fsync(2) doesn't, based on Apple's advice[1] for databases hackers to
call fcntl(fd, F_FULLSYNC, 0) instead.  We do that.

Speaking as an armchair Internet Unix detective, my guess is: no.  In
the source[2] we can see that there is a real system call table entry
and VFS support, so there is *something* wired up to this lever.  On
the other hand, it shares code with fsync(2), and I suppose that
fdatasync(2) isn't going to do *more* than fsync(2).  But who knows?
Not only is it unreleased, but below VNOP_FSYNC() you reach closed
source file system code.

That was fun, but now I'm asking myself: do we really want to use an
IO synchronisation facility that's not declared by the vendor?  I see
that our declaration goes back 20 years to 33cc5d8a, which introduced
fdatasync(2).  The discussion from the time[3] makes it clear that the
OS support was very patchy and thin back then.

Just by the way, another fun thing I learned about libSystem while
reading up on Big Sur changes is that the system libraries are no
longer on the file system.  dlopen() is magical.

[1] 
https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html
[2] 
https://github.com/apple/darwin-xnu/blob/d4061fb0260b3ed486147341b72468f836ed6c8f/bsd/vfs/vfs_syscalls.c#L7708
[3] 
https://www.postgresql.org/message-id/flat/200102171805.NAA24180%40candle.pha.pa.us

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Bharath Rupireddy

On Fri, Jan 15, 2021 at 11:33 AM Hou, Zhijie  wrote:
>
> > On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
> > > Do we really need to access PUBLICATIONRELMAP in this patch? What if
> > > we just set it to false in the else condition of (if (publish &&
> > > (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
> > >
> > > Thank for you review. I agree with you, it doesn’t need to access
> > > PUBLICATIONRELMAP, since We already get the publication oid in
> > > GetRelationPublications(relid) [1], which also access
> > > PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the
> > publish is false, so we do not need publish the table.
> >
> > +1. This is enough.
> >
> > > I have another question, the data->publications is a list, when did it
> > has more then one items?
> >
> > IIUC, when the single table is associated with multiple publications, then
> > data->publications will have multiple entries. Though I have not tried,
> > we can try having two or three publications for the same table and verify
> > that.
> >
> > > 0001 - change as you suggest.
> > > 0002 - does not change.
>
>
> Hi
>
> In 0001 patch
>
> The comments says " Relation is not associated with the publication anymore "
> That comment was accurate when check is_relation_part_of_publication() in the 
> last patch.
>
> But when put the code in the else branch, not only the case in the 
> comments(Relation is dropped),
> Some other case such as "partitioned tables" will hit else branch too.
>
> Do you think we need fix the comment a little to make it accurate?

Thanks, that comment needs a rework, in fact the else condition
also(because we may fall into else branch even when publish is true
and (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot) is
false, but our intention(to fix the bug reported here) is to have the
else condition only when publish is false. And also instead of just
relying on the publish variable which can get set even when the
relation is not in the publication but ancestor_published is true, we
can just have something like below to fix the bug reported here:

if (publish &&
(relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot))
{
entry->pubactions.pubinsert |= pub->pubactions.pubinsert;
entry->pubactions.pubupdate |= pub->pubactions.pubupdate;
entry->pubactions.pubdelete |= pub->pubactions.pubdelete;
entry->pubactions.pubtruncate |= pub->pubactions.pubtruncate;
}
else if (!list_member_oid(pubids, pub->oid))
{
/*
 * Relation is not associated with the publication anymore i.e
 * it would have been dropped from the publication. So we don't
 * need to publish the changes for it.
 */
entry->pubactions.pubinsert = false;
entry->pubactions.pubupdate = false;
entry->pubactions.pubdelete = false;
entry->pubactions.pubtruncate = false;
}

So this way, the fix only focuses on what we have reported here and it
doesn't cause any regressions may be, and the existing comment becomes
appropriate.

Thoughts?

> And it seems we can add a "continue;" in else branch.
> Of course this it not necessary.

As you said, it's not necessary because the following if condition
will fail anyways. Having said that, if we add continue, then any
future code that gets added after the following if condition will
never get executed. Though the reasoning may sound silly, IMO let's
not add the continue in the else branch.

if (entry->pubactions.pubinsert && entry->pubactions.pubupdate &&
entry->pubactions.pubdelete && entry->pubactions.pubtruncate)
break;

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: adding wait_start column to pg_locks

2021-01-14 Thread torikoshia


Thanks for your reviewing and comments!

On 2021-01-14 12:39, Ian Lawrence Barwick wrote:
Looking at the code, this happens as the wait start time is being 
recorded in
the lock record itself, so always contains the value reported by the 
latest lock

acquisition attempt.


I think you are right and wait_start should not be recorded
in the LOCK.


On 2021-01-15 11:48, Ian Lawrence Barwick wrote:

2021年1月15日(金) 3:45 Robert Haas :


On Wed, Jan 13, 2021 at 10:40 PM Ian Lawrence Barwick
 wrote:

It looks like the logical place to store the value is in the

PROCLOCK

structure; ...


That seems surprising, because there's one PROCLOCK for every
combination of a process and a lock. But, a process can't be waiting
for more than one lock at the same time, because once it starts
waiting to acquire the first one, it can't do anything else, and
thus
can't begin waiting for a second one. So I would have thought that
this would be recorded in the PROC.


Umm, I think we're at cross-purposes here. The suggestion is to note
the time when the process started waiting for the lock in the
process's
PROCLOCK, rather than in the lock itself (which in the original
version
of the patch resulted in all processes with an interest in the lock
appearing
to have been waiting to acquire it since the time a lock acquisition
was most recently attempted).


AFAIU, it seems possible to record wait_start in the PROCLOCK but
redundant since each process can wait at most one lock.

To confirm my understanding, I'm going to make another patch that
records wait_start in the PGPROC.


Regards,

--
Atsushi Torikoshi

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Bharath Rupireddy

On Thu, Jan 14, 2021 at 10:53 AM Amit Kapila  wrote:
> > 0002 - Invalidates the relation map cache in subscriber syscache
> > invalidation callbacks. Currently, I'm setting entry->state to
> > SUBREL_STATE_UNKNOWN in the new invalidation function that's
> > introduced logicalrep_relmap_invalidate, so that in the next call to
> > logicalrep_rel_open the entry's state will be read from the system
> > catalogues using GetSubscriptionRelState. If this is sound's bit
> > strange, I can add a boolean invalid to LogicalRepRelMapEntry
> > structure and set that here and in logicalrep_rel_open, I can have
> > something like:
> > if (entry->state != SUBREL_STATE_READY || entry->invalid)
> > entry->state = GetSubscriptionRelState(MySubscription->oid,
> >entry->localreloid,
> >&entry->statelsn);
> >
> >if (entry->invalid)
> > entry->invalid = false;
> >
>
> So, the point of the second patch is that it good to have such a
> thing, otherwise, we don't see any problem if we just use patch-0001,
> right? I think if we fix the first-one, automatically, we will achieve
> what we are trying to with the second-one because ultimately
> logicalrep_relmap_update will be called due to Alter Pub... Drop table
> .. step and it will mark the entry as SUBREL_STATE_UNKNOWN.

On some further analysis, I found that logicalrep_relmap_update is
called in subscribers for inserts, delets, updates and truncate
statements on the dropped(from publication) relations in the
publisher.

If any alters to pg_subscription, then we invalidate the subscription
in subscription_change_cb, maybe_reread_subscription subscription will
take care of re-reading from the system catalogues, see
apply_handle_->ensure_transaction -> maybe_reread_subscription.
And we don't have any entry in LogicalRepRelMapEntry that gets
affected because of changes to pg_subscription, so we don't need to
invalidate the rel map cache entries in subscription_change_cb.

If any alters to pg_subscription_rel, then there are two parameters in
LogicalRepRelMapEntry structure, state and statelsn that may get
affected. Changes to statelsn column is taken care of with the
invalidation callback invalidate_syncing_table_states setting
table_states_valid to true and
process_syncing_tables->process_syncing_tables_for_apply. But, if
state column is changed somehow (although I'm not quite sure how it
can change to different states 'i', 'd', 's', 'r' after the initial
table sync phase in logical replication, but we can pretty much do
something like update pg_subscription_rel set srsubstate = 'd';), in
this case invalidate_syncing_table_states gets called, but if we don't
revalidation of relation map cache entries, they will have the old
state value.

So what I feel is at least we need the 0002 patch but with only
invalidating the entries in invalidate_syncing_table_states not in
subscription_change_cb, although I'm not able to back it up with any
use case or bug as such.

Thoughts?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

RE: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Hou, Zhijie

> On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
> > Do we really need to access PUBLICATIONRELMAP in this patch? What if
> > we just set it to false in the else condition of (if (publish &&
> > (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
> >
> > Thank for you review. I agree with you, it doesn’t need to access
> > PUBLICATIONRELMAP, since We already get the publication oid in
> > GetRelationPublications(relid) [1], which also access
> > PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the
> publish is false, so we do not need publish the table.
> 
> +1. This is enough.
> 
> > I have another question, the data->publications is a list, when did it
> has more then one items?
> 
> IIUC, when the single table is associated with multiple publications, then
> data->publications will have multiple entries. Though I have not tried,
> we can try having two or three publications for the same table and verify
> that.
> 
> > 0001 - change as you suggest.
> > 0002 - does not change.


Hi

In 0001 patch

The comments says " Relation is not associated with the publication anymore "
That comment was accurate when check is_relation_part_of_publication() in the 
last patch.

But when put the code in the else branch, not only the case in the 
comments(Relation is dropped),
Some other case such as "partitioned tables" will hit else branch too.

Do you think we need fix the comment a little to make it accurate?


And it seems we can add a "continue;" in else branch.
Of course this it not necessary.

Best regards,
houzj

Re: Yet another fast GiST build

2021-01-14 Thread Peter Eisentraut


On 2021-01-12 14:49, Heikki Linnakangas wrote:

I suggest calling BuildIndexValueDescription() from your own custom
debug instrumentation code.

Thanks for the hint, Peter!
This function does exactly what I want to do. But I have no Relation inside 
gist_page_items(bytea) function... probably, I'll add gist_page_items(relname, 
blockno) overload to fetch keys.


PFA patch with implementation.


I did a bit of cleanup on the function signature. The .sql script
claimed that gist_page_items() took bytea as argument, but in reality it
was a relation name, as text. I changed it so that it takes a page image
as argument, instead of reading the block straight from the index.
Mainly to make it consistent with brin_page_items(), if it wasn't for
that precedence I might've gone either way on it.


I noticed this patch while working on another patch for pageinspect [0], 
and this one appears to introduce a problem similar to the one the other 
patch attempts to fix: The "itemlen" output parameters are declared to 
be of type smallint, but the underlying C data is of type uint16 
(OffsetNumber).  I don't know the details of gist enough to determine 
whether overflow is possible here.  If not, perhaps a check or at least 
a comment would be useful.  Otherwise, these parameters should be of 
type int in SQL.


[0]: 
https://www.postgresql.org/message-id/09e2dd82-4eb6-bba1-271a-d2b58bf6c...@enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-14 Thread Masahiko Sawada

On Fri, Jan 15, 2021 at 4:03 AM Zhihong Yu  wrote:
>
> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

Thank you for reviewing the patch!

>
> +   boolhave_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.

Fixed.

>
> +* Two-phase commit is not required if the number of servers performed
>
> performed -> performing

Fixed.

>
> +errmsg("cannot process a distributed transaction that has 
> operated on a foreign server that does not support two-phase commit 
> protocol"),
> +errdetail("foreign_twophase_commit is \'required\' but the 
> transaction has some foreign servers which are not capable of two-phase 
> commit")));
>
> The lines are really long. Please wrap into more lines.

Hmm, we can do that but if we do that, it makes grepping by the error
message hard. Please refer to the documentation about the formatting
guideline[1]:

Limit line lengths so that the code is readable in an 80-column
window. (This doesn't mean that you must never go past 80 columns. For
instance, breaking a long error message string in arbitrary places
just to keep the code within 80 columns is probably not a net gain in
readability.)

These changes have been made in the local branch. I'll post the
updated patch set after incorporating all the comments.

Regards,

[1] https://www.postgresql.org/docs/devel/source-format.html

-- 
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

fix typo in reorderbuffer.c

2021-01-14 Thread Hou, Zhijie

Hi

I found a possible typo in reorderbuffer.c

 *   has got a combocid. Combocid's are only valid for the duration of a

Combocid's ==>> Combocids

Attatching a small patch to correct it.

Best regards,
houzj




0001-fix-typo.patch
Description: 0001-fix-typo.patch

Pg14, pg_dumpall and "password_encryption=true"

2021-01-14 Thread Ian Lawrence Barwick

Greetings

Consider the following:

postgres=# SELECT current_setting('server_version');
 current_setting
-
 12.5
(1 row)

postgres=# SELECT * FROM pg_db_role_setting WHERE setrole = 10;
 setdatabase | setrole | setconfig
-+-+
   0 |  10 | {password_encryption=true}
(1 row)

$ $PG_14/pg_dumpall --version
pg_dumpall (PostgreSQL) 14devel

$ $PG_14/pg_dumpall -d 'port=5414' --globals-only | grep
"password_encryption"
ALTER ROLE postgres SET password_encryption TO 'true';

This command will fail in Pg14, e.g. when running pg_upgrade (which is where
I bumped into the issue):

$ pg_upgrade --whatever
...
Restoring global objects in the new cluster
*failure*

Consult the last few lines of "pg_upgrade_utility.log" for
the probable cause of the failure.
Failure, exiting

$ tail -3 pg_upgrade_utility.log
ALTER ROLE "postgres" SET "password_encryption" TO 'true';
psql:pg_upgrade_dump_globals.sql:75: ERROR:  invalid value for
parameter "password_encryption": "true"
HINT:  Available values: md5, scram-sha-256.

This is a consquence of commit c7eab0e97, which removed support for the
legacy
values "on"/"true"/"yes"/"1".

I see following options to resolve this issue.

1. Re-add support for a "true" as an alias for "md5".


Easiest option.

The disadvantage is that in Pg10 ~ Pg13, "password_encryption=true"
implictly meant "use the default encryption method", which was "md5",
however the default in Pg14 is "scram-sha-256", so the meaning
of "true" changes to "set to the non-default method".


2. Transparently rewrite "true" to "md5"


AFAICS this isn't supported (we can remap parameter names, e.g.
sort_mem -> work_mem, but not enum values), so would require some
refactoring.

Adding a 4th field to "config_enum_entry" which, if not NULL, would cause
the setting to be stored as the specified value, e.g.:

  {"true", PASSWORD_TYPE_MD5, true, "md5"},

seems like a nice way of defining it, but poking about, it looks non-trivial
to actually implement, and a lot of code for a single workaround.
I suppose, if feasible, it might be useful future, but can't think
of anywhere else it might be useful.


3. Have pg_dumpall map "true" to "md5"
--

This would require adding an exception for "password_encryption" (currently
all
settings are dumped as-is).

However this would result in dumps from pre-Pg10 versions which would not be
valid for restoring into those versions (assuming we support later versions
of
pg_dump/pg_dumpall being able to produce dumps from older versions which are
valid for those older versions).


4. Add an option to pg_dumpall to specify the target version


As (4), but only remap if the target version is 10 or later.

And have pg_upgrade pass the target version option to pg_dumpall.


5. Have pg_upgrade emit a warning/hint
--

This would catch what a common occurrence of this issue.

However it's annoying for the user to have to deal with this during an
upgrade. And it still leaves Pg14's pg_dumpall at risk of creating dumps
which
are not actually valid for Pg14, which will cause issues when doing
upgrades via
logical replication etc.


6. Document this as a backwards-compatibility thing


Add an item in the documentation (release notes, pg_upgrade, pg_dumpall)
stating
that any occurrences of "password_encryption" which are not valid in Pg14
should
be updated before performing an upgrade.

This has the disadvantage that it's just documentation and as such at risk
of
being overlooked, time and time again.

And it still also leaves Pg14's pg_dumpall at risk of creating dumps which
are
not actually valid for Pg14.

I'll add to the next CF so this doesn't get lost.

Regards

Ian Barwick


-- 
EnterpriseDB: https://www.enterprisedb.com

Re: vacuum_cost_page_miss default value and modern hardware

2021-01-14 Thread Masahiko Sawada

On Thu, Jan 14, 2021 at 9:24 AM Peter Geoghegan  wrote:
>
> On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan  wrote:
> > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
> > has had the value as 10 as its default since 2004 (just like
> > vacuum_cost_page_dirty, whose default has also not been changed since
> > the start). These defaults were decided in a time when nbtree VACUUM
> > could do lots of random I/O, there was no visibility map, etc. So this
> > refresh is not just about hardware.
>
> Attached patch lowers vacuum_cost_page_miss to 3. I think that this
> change in the default is both likely to be helpful in medium to large
> installations, and unlikely to cause harm in small installations. If
> I/O for reads made by VACUUM is naturally very slow (even in the
> common case where it's entirely sequential), then that will naturally
> provide additional throttling.

+1 for this change. Lowering to 2 also looks good to me.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/

RE: invalid data in file backup_label problem on windows

2021-01-14 Thread Wang, Shenhao

Hi, Michael

>Please feel free to, under the section "Bug fixes".  This way, it
>won't get lost in the traffic of this list.
>--
>Michael

Thank you for your advise, added it

Best Regards,
Shenhao Wang

Re: [bug fix] Fix the size calculation for shmem TOC

2021-01-14 Thread Fujii Masao





On 2021/01/14 18:15, Fujii Masao wrote:



On 2021/01/14 17:47, tsunakawa.ta...@fujitsu.com wrote:

Hello,


The attached patch fixes a trivial mistake in the calculation of shmem TOC.  
The original code allocates unduly large memory because it adds the result of 
add_size() to its left argument.

Thanks for the report and patch! The patch looks good to me.


Pushed and back-patched to v11. Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: invalid data in file backup_label problem on windows

2021-01-14 Thread Michael Paquier

On Fri, Jan 15, 2021 at 03:29:57AM +, Wang, Shenhao wrote:
> On windows, if a backup_label file contains a windows(CRLF) line ending.
> Recovering from this backup will failed.
> 
> I think this is a problem, can I add this to commitfest?

Please feel free to, under the section "Bug fixes".  This way, it
won't get lost in the traffic of this list.
--
Michael


signature.asc
Description: PGP signature

RE: invalid data in file backup_label problem on windows

2021-01-14 Thread Wang, Shenhao

Hi, again:

On windows, if a backup_label file contains a windows(CRLF) line ending.
Recovering from this backup will failed.

I think this is a problem, can I add this to commitfest?

Best Regards,
Shenhao Wang

-Original Message-
From: Wang, Shenhao  
Sent: Sunday, January 10, 2021 8:58 PM
To: pgsql-hackers@lists.postgresql.org
Subject: invalid data in file backup_label problem on windows

Hi hackers.

When I using a Non-Exclusive Low-Level Backup on windows, "invalid data in file 
backup_label" will be shown.

The code are listed below

if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
   &hi, &lo, &tli_from_walseg, startxlogfilename, &ch) 
!= 5 || ch != '\n')
ereport(FATAL,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("invalid data in file \"%s\"", 
BACKUP_LABEL_FILE)));

When I wrote the backup_label on windows, CRLF will at the end of line, so the 
ch will be '\r'. 

I'm not sure that these two files(tablespace_map and backup_label) should not 
use CRLF new line style is mentioned in manual[1].
How about setting the text mode when reading these 2 files like patch listed in 
attachment?

Any thought?

[1] 
https://www.postgresql.org/docs/13/continuous-archiving.html#BACKUP-LOWLEVEL-BASE-BACKUP


Best Regards,
Shenhao Wang

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Amit Kapila

On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote:
>
> On Jan 14, 2021, at 1:25 PM, Amit Kapila  wrote:
>
> Attaching following two patches:
>
> 0001 - Makes publisher to not publish the changes for the alter
> publication ... dropped tables. Original patch is by japin, I added
> comments, changed the function name and adjusted the code a bit.
>
>
> Do we really need to access PUBLICATIONRELMAP in this patch? What if
> we just set it to false in the else condition of (if (publish &&
> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
>
>
> Thank for you review. I agree with you, it doesn’t need to access 
> PUBLICATIONRELMAP, since
> We already get the publication oid in GetRelationPublications(relid) [1], 
> which also access
> PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the publish 
> is false, so we
> do not need publish the table.
>
> I have another question, the data->publications is a list, when did it has 
> more then one items?
>
> 0001 - change as you suggest.
>

Can we add a test case for this patch as this seems to be a
reproducible bug? We can add the test in
src/test/subscription/100_bugs.pl unless you find a better place to
add it.

-- 
With Regards,
Amit Kapila.

Re: Remove PG_SHA*_DIGEST_STRING_LENGTH from sha2.h

2021-01-14 Thread Michael Paquier

On Thu, Jan 14, 2021 at 03:11:12PM +0900, Michael Paquier wrote:
> I have just noticed that aef8948 has removed the last reference to
> PG_SHA256_DIGEST_STRING_LENGTH in the code (this was getting used to
> know the length of a SHA256 digest encoded to hex for checksum
> manifests, but pg_hex_enc_len() calculates the same in a more
> consistent way).  Attached is a patch to remove all those now-useless
> declarations, for all four SHA2 options.

Cleanup done as of ccf4e27.
--
Michael


signature.asc
Description: PGP signature

Re: adding wait_start column to pg_locks

2021-01-14 Thread Ian Lawrence Barwick

2021年1月15日(金) 3:45 Robert Haas :

> On Wed, Jan 13, 2021 at 10:40 PM Ian Lawrence Barwick 
> wrote:
> > It looks like the logical place to store the value is in the PROCLOCK
> > structure; ...
>
> That seems surprising, because there's one PROCLOCK for every
> combination of a process and a lock. But, a process can't be waiting
> for more than one lock at the same time, because once it starts
> waiting to acquire the first one, it can't do anything else, and thus
> can't begin waiting for a second one. So I would have thought that
> this would be recorded in the PROC.
>

Umm, I think we're at cross-purposes here. The suggestion is to note
the time when the process started waiting for the lock in the process's
PROCLOCK, rather than in the lock itself (which in the original version
of the patch resulted in all processes with an interest in the lock
appearing
to have been waiting to acquire it since the time a lock acquisition
was most recently attempted).

As mentioned, I hadn't really ever looked at the lock code before so might
be barking up the wrong forest.

Regards

Ian Barwick

-- 
EnterpriseDB: https://www.enterprisedb.com

RE: POC: postgres_fdw insert batching

2021-01-14 Thread tsunakawa.ta...@fujitsu.com

From: Tomas Vondra 
> Attached is v9 with all of those tweaks, except for moving the BatchSize call 
> to
> BeginForeignModify - I tried that, but it did not seem like an improvement,
> because we'd still need the checks for API callbacks in ExecInsert for 
> example.
> So I decided not to do that.

Thanks, Tomas-san.  The patch looks good again.

Amit-san, thank you for teaching us about es_tuple_routing_result_relations and 
es_opened_result_relations.


Regards
Takayuki Tsunakawa

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-14 Thread Michael Paquier

On Fri, Jan 15, 2021 at 03:25:24PM +1300, Thomas Munro wrote:
> Thanks Michael!  Another notch for the unnecessary system call
> hitlist: https://wiki.postgresql.org/wiki/Syscall_Reduction

A quick question.  How much does it matter in terms of
micro-performance for this code path depending on max/min_wal_size?
Andres has mentioned its aio work, without telling any numbers.
--
Michael


signature.asc
Description: PGP signature

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 15.01.2021 04:45, Tom Lane wrote:

Hence, I propose the attached.  This works as far as I can tell
to fix the problem you're seeing.

Yes, it works fine. Thank you very much.

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-14 Thread Thomas Munro

On Fri, Jan 15, 2021 at 3:07 PM Michael Paquier  wrote:
> On Wed, Jan 13, 2021 at 04:27:25PM +0900, Michael Paquier wrote:
> > I have been looking again at that, and the rebased version that Andres
> > has provided would take care of that.  Any thoughts?
>
> Hearing nothing, I have applied the thing on HEAD after more tests and
> more reads of the code.

Thanks Michael!  Another notch for the unnecessary system call
hitlist: https://wiki.postgresql.org/wiki/Syscall_Reduction

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 15.01.2021 05:04, Tom Lane wrote:


on her machine there's no detail at all:

% xcrun --verbose --no-cache --show-sdk-path
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk



The same on my machine. I get details for --find, but not for 
--show-sdk-path.




So I'm not sure what to make of that.  But I'm hesitant to assume
that xcrun is just a wrapper around xcodebuild.



I agree.

Re: O(n^2) system calls in RemoveOldXlogFiles()

2021-01-14 Thread Michael Paquier

On Wed, Jan 13, 2021 at 04:27:25PM +0900, Michael Paquier wrote:
> I have been looking again at that, and the rebased version that Andres
> has provided would take care of that.  Any thoughts?

Hearing nothing, I have applied the thing on HEAD after more tests and
more reads of the code.
--
Michael


signature.asc
Description: PGP signature

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 15.01.2021 04:53, Sergey Shinderuk wrote:


I see that "xcrun --sdk macosx --show-sdk-path" really calls
"/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
macosx -version Path" under the hood.




... and caches the result. xcodebuild not called without --no-cache.
So it still make sense to fall back on xcodebuild.

% sudo dtrace -n 'pid$target::popen:entry { trace(copyinstr(arg0)) }' -c 
'xcrun --sdk macosx --show-sdk-path'

dtrace: description 'pid$target::popen:entry ' matched 1 probe
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
dtrace: pid 26981 has exited

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

Sergey Shinderuk  writes:
> I see that "xcrun --sdk macosx --show-sdk-path" really calls
> "/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
> macosx -version Path" under the hood.

Hmm.  I found something odd on my wife's Mac: although on my other
machines, I get something like

$ xcrun --verbose --no-cache --show-sdk-path
xcrun: note: looking up SDK with 
'/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
 -version PlatformPath'
xcrun: note: PATH = 
'/Users/tgl/testversion/bin:/usr/local/autoconf-2.69/bin:/Users/tgl/bin:/usr/local/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:/Library/Apple/usr/bin:/Library/Tcl/bin:/opt/X11/bin'
xcrun: note: SDKROOT = 
'/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk'
xcrun: note: TOOLCHAINS = ''
xcrun: note: DEVELOPER_DIR = '/Applications/Xcode.app/Contents/Developer'
xcrun: note: XCODE_DEVELOPER_USR_PATH = ''
xcrun: note: xcrun_db = 
'/var/folders/3p/2bnrmypd17jcqbtzw79t9blwgn/T/xcrun_db'
xcrun: note: lookup resolved to: 
'/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform'
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

on her machine there's no detail at all:

% xcrun --verbose --no-cache --show-sdk-path
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk

So I'm not sure what to make of that.  But I'm hesitant to assume
that xcrun is just a wrapper around xcodebuild.

regards, tom lane

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 15.01.2021 01:13, Tom Lane wrote:


than relying entirely on xcodebuild.  Maybe there's a case for trying
"xcrun --sdk macosx --show-sdk-path" in between; in my tests that
seemed noticeably faster than invoking xcodebuild, and I've not yet
seen a case where it gave a different answer.



I see that "xcrun --sdk macosx --show-sdk-path" really calls
"/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
macosx -version Path" under the hood.



% lldb -- xcrun --no-cache --sdk macosx --show-sdk-path
(lldb) target create "xcrun"
Current executable set to 'xcrun' (x86_64).
(lldb) settings set -- target.run-args  "--no-cache" "--sdk" "macosx" 
"--show-sdk-path"

(lldb) settings set target.unset-env-vars SDKROOT
(lldb) b popen
Breakpoint 1: where = libsystem_c.dylib`popen, address = 0x7fff67265607
(lldb) r
Process 26857 launched: '/usr/bin/xcrun' (x86_64)
Process 26857 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1
frame #0: 0x7fff6e313607 libsystem_c.dylib`popen
libsystem_c.dylib`popen:
->  0x7fff6e313607 <+0>: pushq  %rbp
0x7fff6e313608 <+1>: movq   %rsp, %rbp
0x7fff6e31360b <+4>: pushq  %r15
0x7fff6e31360d <+6>: pushq  %r14
Target 0: (xcrun) stopped.
(lldb) p (char *)$arg1
(char *) $1 = 0x000100406960 
"/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
macosx -version Path"



% sudo dtrace -n 'pid$target::popen:entry { trace(copyinstr(arg0)) }' -c 
'xcrun --sdk macosx --show-sdk-path'

dtrace: description 'pid$target::popen:entry ' matched 1 probe
CPU IDFUNCTION:NAME
  0 413269  popen:entry 
/Applications/Xcode.app/Contents/Developer/usr/bin/xcodebuild -sdk 
macosx -version Path

/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
dtrace: pid 26905 has exited

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

Sergey Shinderuk  writes:
> On 15.01.2021 01:13, Tom Lane wrote:
>> Also, after re-reading [1] I am not at all excited about trying to
>> remove the -isysroot switches from our *FLAGS.  What I propose to do
>> is keep that, but improve our mechanism for choosing a default value
>> for PG_SYSROOT.  It looks like first trying "xcrun --show-sdk-path",
>> and falling back to "xcodebuild -version -sdk macosx Path" if that
>> doesn't yield a valid path, is more likely to give a working build
>> than relying entirely on xcodebuild.  Maybe there's a case for trying
>> "xcrun --sdk macosx --show-sdk-path" in between; in my tests that
>> seemed noticeably faster than invoking xcodebuild, and I've not yet
>> seen a case where it gave a different answer.

> I spent quite some time trying to understand / reverse engineer the 
> logic behind xcrun's default SDK selection.

Yeah, I wasted a fair amount of time on that too, going so far as
to ktrace xcrun (as I gather you did too).  I'm not any more
enlightened than you are about exactly how it's making the choice.

> Oh, that's weird! Nevertheless I like you suggestion to call "xcrun" 
> from "configure".

Anyway, after re-reading the previous thread, something I like about
the current behavior is that it tends to produce a version-numbered
sysroot path, ie something ending in "MacOSX11.1.sdk" or whatever.
One of the hazards we're trying to avoid is some parts of a PG
installation being built against one SDK version while other parts are
built against another.  The typical behavior of "xcrun --show-sdk-path"
seems to be to produce a path ending in "MacOSX.sdk", which defeats that.
So I think we should accept the path only if it contains a version number,
and otherwise move on to the other probe commands.

Hence, I propose the attached.  This works as far as I can tell
to fix the problem you're seeing.

regards, tom lane

diff --git a/src/template/darwin b/src/template/darwin
index 32414d21a9..0c890482fe 100644
--- a/src/template/darwin
+++ b/src/template/darwin
@@ -3,11 +3,26 @@
 # Note: Darwin is the original code name for macOS, also known as OS X.
 # We still use "darwin" as the port name, partly because config.guess does.
 
-# Select where system include files should be sought.
+# Select where system include files should be sought, if user didn't say.
 if test x"$PG_SYSROOT" = x"" ; then
-  PG_SYSROOT=`xcodebuild -version -sdk macosx Path 2>/dev/null`
+  # This is far more complicated than it ought to be.  We first ask
+  # "xcrun --show-sdk-path", which seems to match the default -isysroot
+  # setting of Apple's compilers.  However, that may produce no result or
+  # a result that is not version-specific (i.e., just ".../SDKs/MacOSX.sdk").
+  # Having a version-specific sysroot seems desirable, so if there are not
+  # digits in the result string, try "xcrun --sdk macosx --show-sdk-path";
+  # and if that still doesn't work, fall back to asking xcodebuild.
+  PG_SYSROOT=`xcrun --show-sdk-path 2>/dev/null`
+  if expr x"$PG_SYSROOT" : '.*[0-9]\.[0-9]' >/dev/null ; then : okay
+  else
+PG_SYSROOT=`xcrun --sdk macosx --show-sdk-path 2>/dev/null`
+if expr x"$PG_SYSROOT" : '.*[0-9]\.[0-9]' >/dev/null ; then : okay
+else
+  PG_SYSROOT=`xcodebuild -version -sdk macosx Path 2>/dev/null`
+fi
+  fi
 fi
-# Old xcodebuild versions may produce garbage, so validate the result.
+# Validate the result: if it doesn't point at a directory, ignore it.
 if test x"$PG_SYSROOT" != x"" ; then
   if test -d "$PG_SYSROOT" ; then
 CPPFLAGS="-isysroot $PG_SYSROOT $CPPFLAGS"

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-14 Thread Fujii Masao





On 2021/01/14 20:36, Bharath Rupireddy wrote:

On Thu, Jan 14, 2021 at 3:52 PM Fujii Masao  wrote:

-   if (!HeapTupleIsValid(tup))
-   elog(ERROR, "cache lookup failed for user mapping %u", 
entry->key);
-   umform = (Form_pg_user_mapping) GETSTRUCT(tup);
-   server = GetForeignServer(umform->umserver);
-   ReleaseSysCache(tup);
+   server = GetForeignServer(entry->serverid);

What about applying only the change about serverid, as a separate patch at
first? This change itself is helpful to get rid of error "cache lookup failed"
in pgfdw_reject_incomplete_xact_state_change(). Patch attached.


Right, we can get rid of the "cache lookup failed for user mapping"
error and also storing server oid in the cache entry is helpful for
the new functions we are going to introduce.

serverid_v1.patch looks good to me. Both make check and make
check-world passes on my system.


Thanks for the check! I pushed the patch.
 

I will respond to other comments soon.


Thanks!

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 14.01.2021 21:05, Tom Lane wrote:

After considerable playing around, I'm guessing that the reason
-no_weak_imports doesn't help is that it rejects calls that are
marked as weak references on the *calling* side.  Since AC_CHECK_FUNCS
doesn't bother to #include the relevant header file, the compiler
doesn't know that preadv() ought to be marked as a weak reference.
Then, when the test program gets linked against the stub libc that's
provided by the SDK, there is a version of preadv() there so no link
failure occurs.  (There are way more moving parts in this weak-reference
thing than I'd realized.)



Oh, that's interesting. I've just played with it a bit and it looks 
exactly as you say.



Another thing I've been realizing while poking at this is that we
might not need to set -isysroot explicitly at all, which would then
lead to the compiler using its default sysroot automatically.
In some experimentation, it seems like what we need PG_SYSROOT for
is just for configure to be able to find tclConfig.sh and the Perl
header files.  So at this point I'm tempted to try ripping that
out altogether.  If you remove the lines in src/template/darwin
that inject PG_SYSROOT into CPPFLAGS and LDFLAGS, do things
work for you?


Yes, it works fine.

Re: vacuum_cost_page_miss default value and modern hardware

2021-01-14 Thread Peter Geoghegan

On Thu, Jan 14, 2021 at 10:42 AM Robert Haas  wrote:
> There are also users I've seen get hosed by vacuuming too
> aggressively. I have seen this happen in two ways. One is too much
> dirty data. The other is too much read I/O, pushing hot data out of
> the cache, leading to a storm of random I/O later when the foreground
> workload needs to get that stuff back, basically killing the system,
> sometimes for hours, while it tries to get back the stuff it lost.

> TL;DR: This change is fine with me, but this whole system has much
> deeper issues.

It seems like there is a good chance the customers of yours that
complained about the read I/O (which was not accompanied by dirtying)
were really bothered by all of their indexes being read by VACUUM. The
freeze map is probably quite effective as far as that goes, but the
burden of index vacuuming is what they tend to notice. This is perhaps
made worse by the sudden infrequent nature of the index vacuuming
against a big append-only or append-mostly table. I imagine that the
problem here is that we're doing index vacuuming when we shouldn't be
-- these customers basically had it right. Their intuition that this
is unnecessary is in fact the right one. How can it be okay to vacuum
an index when the table only has 10 dead tuples (just to freeze some
pages at the end of the table)? That's ridiculous. And it has nothing
to do with these settings. (Even if I'm wrong to suggest that that was
what it was, I think that the details and nuance of what actually
happened is likely to be important.)

We should be avoiding index vacuuming in many more cases. If there are
only a tiny number of garbage index tuples, then we really shouldn't
bother (once again, please feel free to weigh in on Masahiko's patch
over on the "New IndexAM API controlling index vacuum strategies"
thread -- that's very interesting work IMV). Bottom-up index deletion
creates far more opportunities for this kind of stuff to naturally
occur. It will now do ~99.9% of garbage tuple cleanup in indexes that
naturally use it all the time. We can expect that intuitions held by
DBAs that have experience with other RDBMSs will start to have more
predictive power when they think about Postgres and VACUUM, which
seems like a very good thing (and something that we can and should
continue to build on). Roughly speaking, we ought to look for more and
more ways to make the physical representation of the data closer to
the logical contents of the database (that's what these DBAs start
with, that's where the intuitions seem to start with, which actually
makes perfect sense).

Now back to the main topic, the GUC's default value. I believe that
your experiences here (the experiences in both directions) are
representative -- I think I've heard of all that myself. Like you, I
think that insufficient vacuuming is much more common than excessive
vacuuming. You do still have some cases where an excessive amount of
I/O from VACUUM (without any dirtying) is the issue (or at least
*seems* to be the issue, per my aside). I think that I have a high
level theory that is consistent with what you say and may suggest a
better direction for us, but it's tricky. I'll try to resolve the
apparent contradictions in my own arguments as I go (I'm a little
burnt out at the moment, so please indulge me).

I think that The Real Problem is *not* that it's too hard to tune this
stuff as a practical matter, exactly. The entire premise of these
costing parameters is that the DBA can and should make a trade-off
between query response time/workload throughput and vacuuming, as if
these two things were separate constituents that are simply unrelated.
That sounds completely wrong to me. It sounds so wrong that I can't go
down that mental path for more than 5 seconds without giving up on it.
Are we really expected to believe that in general VACUUM probably has
all the time in the world, and so should proceed at a leisurely pace?
It's almost as if the original designer imagined that the IT
department should be made to wait on the result of one of those
"VACUUM bigtable;" reports that they seem to be so keen on (other
queries are running that deliver real business value, after all). I'm
only half-joking here -- anybody reading this should now take a moment
to actively consider just how little sense any of this makes. It's so
completely and implausibly wrong that it seems likely to actually be
slightly right, if only by mistake.

There seems to be one important way in which the cost parameter design
is accidentally useful: the page dirtying stuff probably works
reasonably well. It really does make sense to throttle VACUUM in
response to dirtying pages, optimistically assuming that VACUUM will
eventually catch up. That part makes sense precisely because it seems
like it treats VACUUM as a thing that is directly tied to the workload
(an accidental happy exception to the bogus general rule for the
costing stuff). Of course, this optimism does not work out because

Re: vacuum_cost_page_miss default value and modern hardware

2021-01-14 Thread Peter Geoghegan

On Thu, Jan 14, 2021 at 9:29 AM Magnus Hagander  wrote:
> Do you have any actual metrics between specifically choosing the value
> 3? Or is that off a gut feeling?

I have no metrics, exactly, but I'm sure that the trend I mentioned
about page cleaning/dirtying being the bottleneck more and more these
days is true. This trend is very apparent to all of this, it seems, so
I am sure that I basically have the right idea here. I'm a little
concerned that it should actually be lowered to 2.

With that said, I don't actually accept what seems to be the original
premise of these GUCs, so I am not interested in using that to justify
changing the vacuum_cost_page_miss default. The premise seems to be:
VACUUM's behavior is determined by treating it as an optimization
problem, so all you as the DBA need to do is characterize the cost of
each kind of elementary operation using the GUCs -- the dynamic
algorithm will do the rest. What algorithm might that be, though? This
is not the optimizer, and there is no scope to come up with a cheaper
plan for VACUUM. Why not throttle longer running queries instead, or
as well?

More on the first principles of the costing stuff in a bit, when I
respond to Robert...

--
Peter Geoghegan

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 15.01.2021 01:13, Tom Lane wrote:

I borrowed my wife's Mac, which is still on Catalina and up to now
never had Xcode on it, and found some very interesting things.

Step 1: download/install Xcode 12.3, open it, agree to license,
wait for it to finish "installing components".

At this point, /Library/Developer/CommandLineTools doesn't exist,
and we have the following outputs from various probe commands:

% xcrun --show-sdk-path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
% xcrun --sdk macosx --show-sdk-path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
% xcodebuild -version -sdk macosx Path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk

Also, cc -v reports
   -isysroot 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

Unsurprisingly, Xcode 12.3 itself only contains

% ls -l 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs
total 0
drwxr-xr-x  5 root  wheel  160 Nov 30 07:27 DriverKit20.2.sdk
drwxr-xr-x  7 root  wheel  224 Nov 30 07:27 MacOSX.sdk
lrwxr-xr-x  1 root  wheel   10 Jan 14 15:57 MacOSX11.1.sdk -> MacOSX.sdk

Step 2: install command line tools (I used "xcode-select --install"
to fire this off, rather than the Xcode menu item).

Now I have

% ls -l /Library/Developer/CommandLineTools/SDKs
total 0
lrwxr-xr-x  1 root  wheel   14 Jan 14 16:42 MacOSX.sdk -> MacOSX11.1.sdk
drwxr-xr-x  8 root  wheel  256 Jul  9  2020 MacOSX10.15.sdk
drwxr-xr-x  7 root  wheel  224 Nov 30 07:33 MacOSX11.1.sdk

which is pretty interesting in itself, because the same directory on
my recently-updated-to-Big-Sur Macs does NOT have the 11.1 SDK.
I wonder what determines which versions get installed here.

More interesting yet:

% xcrun --show-sdk-path
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
% xcrun --sdk macosx --show-sdk-path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
% xcodebuild -version -sdk macosx Path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk

and cc -v reports
   -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk

So apparently, "xcrun --show-sdk-path" (without any -sdk option)
is the most authoritative guide to the compiler's default sysroot.

However, googling turns up various people reporting that "xcrun
--show-sdk-path" returns an empty string for them, and our last
major investigation into this [1] found that there are some system
states where the compiler appears to have no default sysroot,
which I bet is the same thing.  I do not at this point have a recipe
to reproduce such a state, but we'd be fools to imagine it's no
longer possible.  My guess about it is that Apple's processes for
updating the default sysroot during system updates are just plain
buggy, with various corner cases that have ill-understood causes.

Also, after re-reading [1] I am not at all excited about trying to
remove the -isysroot switches from our *FLAGS.  What I propose to do
is keep that, but improve our mechanism for choosing a default value
for PG_SYSROOT.  It looks like first trying "xcrun --show-sdk-path",
and falling back to "xcodebuild -version -sdk macosx Path" if that
doesn't yield a valid path, is more likely to give a working build
than relying entirely on xcodebuild.  Maybe there's a case for trying
"xcrun --sdk macosx --show-sdk-path" in between; in my tests that
seemed noticeably faster than invoking xcodebuild, and I've not yet
seen a case where it gave a different answer.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/20840.1537850987%40sss.pgh.pa.us



Thanks for thorough investigation and sorry for the late reply.

I spent quite some time trying to understand / reverse engineer the 
logic behind xcrun's default SDK selection. Apparently, "man xcrun" is 
not accurate saying:


	The  SDK  which  will  be searched defaults to the most recent 
available...


I didn't find anything really useful or helpful. 
"/Library/Developer/CommandLineTools" is hardcoded into 
"libxcrun.dylib". On my machine xcrun scans


/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs

and

/Library/Developer/CommandLineTools/SDKs

in that order, and loads "SDKSettings.plist" from each subdirectory. I 
looked into plists, but couldn't find anything special about 
"MacOSX10.15.sdk".



Okay, here is what I have:

% ls -l 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs

total 0
drwxr-xr-x  5 root  wheel  160 Nov 30 15:27 DriverKit20.2.sdk
drwxr-xr-x  7 root  wheel  224 Nov 30 15:27 MacOSX.sdk
lrwxr-xr-x  1 root  wheel   10 Dec 17 14:25 MacOSX11.1.sdk -> MacOSX.sdk

% ls -l /Library/Developer/CommandLineTools/SDKs
total 0
lrwxr-xr-x  1 root  wheel   14

Re: Add table access method as an option to pgbench

2021-01-14 Thread David Zhang


On 2021-01-09 5:44 a.m., youichi aramaki wrote:


+  "   create tables with using specified 
table access method,\n"

In my opinion,  this sentence should be "create tables with specified table access 
method" or "create tables using specified table access method".
"create tables with specified table access method" may be more consistent with 
other options.


Thank you Youichi. I will change it to "create tables with specified 
table access method" in next patch.


--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-14 Thread Zhihong Yu

For v32-0002-postgres_fdw-supports-commit-and-rollback-APIs.patch :

+   entry->changing_xact_state = true;
...
+   entry->changing_xact_state = abort_cleanup_failure;

I don't see return statement in between the two assignments. I wonder
why entry->changing_xact_state is set to true, and later being assigned
again.

For v32-0007-Introduce-foreign-transaction-launcher-and-resol.patch :

bq. This commits introduces to new background processes: foreign

commits introduces to new -> commit introduces two new

+FdwXactExistsXid(TransactionId xid)

Since Xid is the parameter to this method, I think the Xid suffix can be
dropped from the method name.

+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group

Please correct year in the next patch set.

+FdwXactLauncherRequestToLaunch(void)

Since the launcher's job is to 'launch', I think the Launcher can be
omitted from the method name.

+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)

Are both Rsover and Rslv referring to resolver ? It would be better to use
whole word which reduces confusion.
Plus, FdwXactRsoverShmemInit should be FdwXactRslvShmemInit (or
FdwXactResolveShmemInit)

+fdwxact_launch_resolver(Oid dbid)

The above method is not in camel case. It would be better if method names
are consistent (in casing).

+errmsg("out of foreign transaction resolver slots"),
+errhint("You might need to increase
max_foreign_transaction_resolvers.")));

It would be nice to include the value of max_foreign_xact_resolvers

For fdwxact_resolver_onexit():

+   LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+   fdwxact->locking_backend = InvalidBackendId;
+   LWLockRelease(FdwXactLock);

There is no call to method inside the for loop which may take time. I
wonder if the lock can be obtained prior to the for loop and released
coming out of the for loop.

+FXRslvLoop(void)

Please use Resolver instead of Rslv

+   FdwXactResolveFdwXacts(held_fdwxacts, nheld);

Fdw and Xact are repeated twice each in the method name. Probably the
method name can be made shorter.

Cheers

On Thu, Jan 14, 2021 at 11:04 AM Zhihong Yu  wrote:

> Hi,
> For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :
>
> +   boolhave_notwophase = false;
>
> Maybe name the variable have_no_twophase so that it is easier to read.
>
> +* Two-phase commit is not required if the number of servers performed
>
> performed -> performing
>
> +errmsg("cannot process a distributed transaction that has
> operated on a foreign server that does not support two-phase commit
> protocol"),
> +errdetail("foreign_twophase_commit is \'required\' but
> the transaction has some foreign servers which are not capable of two-phase
> commit")));
>
> The lines are really long. Please wrap into more lines.
>
>
>
> On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada 
> wrote:
>
>> On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu  wrote:
>> >
>> > Hi,
>>
>> Thank you for reviewing the patch!
>>
>> > For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
>> >
>> > However these functions are not neither committed nor aborted at
>> >
>> > I think the double negation was not intentional. Should be 'are neither
>> ...'
>>
>> Fixed.
>>
>> >
>> > For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the
>> return statement ?
>>
>> Hmm, you mean that we need MAXALIGN(size) after adding the size of
>> FdwXactData structs?
>>
>> Size
>> FdwXactShmemSize(void)
>> {
>> Sizesize;
>>
>> /* Size for foreign transaction information array */
>> size = offsetof(FdwXactCtlData, fdwxacts);
>> size = add_size(size, mul_size(max_prepared_foreign_xacts,
>>sizeof(FdwXact)));
>> size = MAXALIGN(size);
>> size = add_size(size, mul_size(max_prepared_foreign_xacts,
>>sizeof(FdwXactData)));
>>
>> return size;
>> }
>>
>> I don't think we need to do that. Looking at other similar code such
>> as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?
>>
>> >
>> > +   fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
>> >
>> > For the function name, Fdw and Xact appear twice, each. Maybe one of
>> them can be dropped ?
>>
>> Agreed. Changed to FdwXactInsertEntry().
>>
>> >
>> > +* we don't need to anything for this participant because all
>> foreign
>> >
>> > 'need to' -> 'need to do'
>>
>> Fixed.
>>
>> >
>> > +   else if (TransactionIdDidAbort(xid))
>> > +   return FDWXACT_STATUS_ABORTING;
>> > +
>> > the 'else' can be omitted since the preceding if would return.
>>
>> Fixed.
>>
>> >
>> > +   if (max_prepared_foreign_xacts <= 0)
>> >
>> > I wonder when the value for max_prepared_foreign_xacts would be
>> negative (and whether that should be considered an error).
>> >
>>
>> Fixed to (max_prepared_foreign_xacts == 0)
>>
>> Attached the updated version patch

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Ryan Lambert

On Thu, Jan 14, 2021 at 2:22 PM Simon Riggs 
wrote:

> On Thu, Jan 14, 2021 at 5:46 PM Surafel Temesgen 
> wrote:
>
> > On Fri, Jan 8, 2021 at 7:50 PM Ryan Lambert 
> wrote:
> >>
> >> I prefer to have them hidden by default.  This was mentioned up-thread
> with no decision, it seems the standard is ambiguous.  MS SQL appears to
> have flip-flopped on this decision [1].
>
> I think the default should be like this:
>
> SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
> should NOT include the Start and End timestamp columns
> because this acts like a normal query just with a different snapshot
> timestamp
>
> SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
> SHOULD include the Start and End timestamp columns
> since this form of query can include multiple row versions for the
> same row, so it makes sense to see the validity times
>

+1

Ryan Lambert

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

I borrowed my wife's Mac, which is still on Catalina and up to now
never had Xcode on it, and found some very interesting things.

Step 1: download/install Xcode 12.3, open it, agree to license,
wait for it to finish "installing components".

At this point, /Library/Developer/CommandLineTools doesn't exist,
and we have the following outputs from various probe commands:

% xcrun --show-sdk-path 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk
% xcrun --sdk macosx --show-sdk-path 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
% xcodebuild -version -sdk macosx Path
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk

Also, cc -v reports
  -isysroot 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

Unsurprisingly, Xcode 12.3 itself only contains

% ls -l 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs
total 0
drwxr-xr-x  5 root  wheel  160 Nov 30 07:27 DriverKit20.2.sdk
drwxr-xr-x  7 root  wheel  224 Nov 30 07:27 MacOSX.sdk
lrwxr-xr-x  1 root  wheel   10 Jan 14 15:57 MacOSX11.1.sdk -> MacOSX.sdk

Step 2: install command line tools (I used "xcode-select --install"
to fire this off, rather than the Xcode menu item).

Now I have

% ls -l /Library/Developer/CommandLineTools/SDKs
total 0
lrwxr-xr-x  1 root  wheel   14 Jan 14 16:42 MacOSX.sdk -> MacOSX11.1.sdk
drwxr-xr-x  8 root  wheel  256 Jul  9  2020 MacOSX10.15.sdk
drwxr-xr-x  7 root  wheel  224 Nov 30 07:33 MacOSX11.1.sdk

which is pretty interesting in itself, because the same directory on
my recently-updated-to-Big-Sur Macs does NOT have the 11.1 SDK.
I wonder what determines which versions get installed here.

More interesting yet:

% xcrun --show-sdk-path 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk
% xcrun --sdk macosx --show-sdk-path  
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
% xcodebuild -version -sdk macosx Path  
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk

and cc -v reports
  -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk

So apparently, "xcrun --show-sdk-path" (without any -sdk option)
is the most authoritative guide to the compiler's default sysroot.

However, googling turns up various people reporting that "xcrun
--show-sdk-path" returns an empty string for them, and our last
major investigation into this [1] found that there are some system
states where the compiler appears to have no default sysroot,
which I bet is the same thing.  I do not at this point have a recipe
to reproduce such a state, but we'd be fools to imagine it's no
longer possible.  My guess about it is that Apple's processes for
updating the default sysroot during system updates are just plain
buggy, with various corner cases that have ill-understood causes.

Also, after re-reading [1] I am not at all excited about trying to
remove the -isysroot switches from our *FLAGS.  What I propose to do
is keep that, but improve our mechanism for choosing a default value
for PG_SYSROOT.  It looks like first trying "xcrun --show-sdk-path",
and falling back to "xcodebuild -version -sdk macosx Path" if that
doesn't yield a valid path, is more likely to give a working build
than relying entirely on xcodebuild.  Maybe there's a case for trying
"xcrun --sdk macosx --show-sdk-path" in between; in my tests that
seemed noticeably faster than invoking xcodebuild, and I've not yet
seen a case where it gave a different answer.

Thoughts?

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/20840.1537850987%40sss.pgh.pa.us

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Simon Riggs

On Thu, Jan 14, 2021 at 5:42 PM Surafel Temesgen  wrote:
>
> Hi Simon,
> Thank you for all the work you does

No problem.

> On Mon, Jan 11, 2021 at 5:02 PM Simon Riggs  
> wrote:
>>
>>
>>
>> * Anomalies around use of CURRENT_TIMESTAMP are not discussed or resolved.
>> Probably need to add a test that end_timestamp > start_timestamp or ERROR,
>> which effectively enforces serializability.
>>
>
>
> This scenario doesn't happen.

Yes, I think it can. The current situation is that the Start or End is
set to the Transaction Start Timestamp.
So if t2 starts before t1, then if t1 creates a row and t2 deletes it
then we will have start=t1 end=t2, but t2 There are no possibility of a record being deleted or updated before inserting

Agreed, but that was not the point.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Simon Riggs

On Thu, Jan 14, 2021 at 5:46 PM Surafel Temesgen  wrote:

> On Fri, Jan 8, 2021 at 7:50 PM Ryan Lambert  wrote:
>>
>> I prefer to have them hidden by default.  This was mentioned up-thread with 
>> no decision, it seems the standard is ambiguous.  MS SQL appears to have 
>> flip-flopped on this decision [1].

I think the default should be like this:

SELECT * FROM foo FOR SYSTEM_TIME AS OF ...
should NOT include the Start and End timestamp columns
because this acts like a normal query just with a different snapshot timestamp

SELECT * FROM foo FOR SYSTEM_TIME BETWEEN x AND y
SHOULD include the Start and End timestamp columns
since this form of query can include multiple row versions for the
same row, so it makes sense to see the validity times

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Cirrus CI (Windows help wanted)

2021-01-14 Thread Thomas Munro

On Wed, Jan 13, 2021 at 3:04 AM Andrew Dunstan  wrote:
> OK, I got this working.

Thanks Andrew.  This is great!

> # cirrus does something odd with this command, so it's stuck in a bat file
> # and copied to the docker container and then executed
> COPY ci/inst-tools.bat .
> RUN \
> cmd /c .\inst-tools.bat

Huh, weird.  Must have been painful to figure out.  The Docker image
step took 42 minutes for me so I shudder to think how the
trial-and-error process went.  But now that it's working and cached,
build jobs start up nice and fast, so that's some great progress.

I'm experimenting with this on my own development branches for now,
and will see what else I can improve.  Then maybe I'll change the
cfbot over after the commitfest.  (Also got to get macOS doing
check-world, and Linux using a fast-start Docker image.)

Re: new heapcheck contrib module

2021-01-14 Thread Robert Haas

On Mon, Jan 11, 2021 at 1:16 PM Mark Dilger
 wrote:
> Added in v32, along with adding pg_amcheck to @contrib_uselibpq, 
> @contrib_uselibpgport, and @contrib_uselibpgcommon

exit_utils.c fails to achieve the goal of making this code independent
of pg_dump, because of:

#ifdef WIN32
if (parallel_init_done && GetCurrentThreadId() != mainThreadId)
_endthreadex(code);
#endif

parallel_init_done is a pg_dump-ism. Perhaps this chunk of code could
be a handler that gets registered using exit_nicely() rather than
hard-coded like this. Note that the function comments for
exit_nicely() are heavily implicated in this problem, since they also
apply to stuff that only happens in pg_dump and not other utilities.

I'm skeptical about the idea of putting functions into string_utils.c
with names as generic as include_filter() and exclude_filter().
Existing cases like fmtId() and fmtQualifiedId() are not great either,
but I think this is worse and that we should do some renaming. On a
related note, it's not clear to me why these should be classified as
string_utils while stuff like expand_schema_name_patterns() gets
classified as option_utils. These are neither generic
string-processing functions nor are they generic options-parsing
functions. They are functions for expanding shell-glob style patterns
for database object names. And they seem like they ought to be
together, because they seem to do closely-related things. I'm open to
an argument that this is wrongheaded on my part, but it looks weird to
me the way it is.

I'm pretty unimpressed by query_utils.c. The CurrentResultHandler
stuff looks grotty, and you don't seem to really use it anywhere. And
it seems woefully overambitious to me anyway: this doesn't apply to
every kind of "result" we've got hanging around, absolutely nothing
even close to that, even though a name like CurrentResultHandler
sounds very broad. It also means more global variables, which is a
thing of which the PostgreSQL codebase already has a deplorable
oversupply. quiet_handler() and noop_handler() aren't used anywhere
either, AFAICS.

I wonder if it would be better to pass in callbacks rather than
relying on global variables. e.g.:

typedef void (*fatal_error_callback)(const char *fmt,...)
pg_attribute_printf(1, 2) pg_attribute_noreturn();

Then you could have a few helper functions that take an argument of
type fatal_error_callback and throw the right fatal error for (a)
wrong PQresultStatus() and (b) result is not one row. Do you need any
other cases? exiting_handler() seems to think that the caller might
want to allow any number of tuples, or any positive number, or any
particular cout, but I'm not sure if all of those cases are really
needed.

This stuff is finnicky and hard to get right. You don't really want to
create a situation where the same code keeps getting duplicated, or
the behavior's just a little bit inconsistent everywhere, but it also
isn't great to build layers upon layers of abstraction around
something like ExecuteSqlQuery which is, in the end, a four-line
function. I don't think there's any problem with something like
pg_dump having it's own function to execute-a-query-or-die. Maybe that
function ends up doing something like
TheGenericFunctionToExecuteOrDie(my_die_fn, the_query), or maybe
pg_dump can just open-code it but have a my_die_fn to pass down to the
glob-expansion stuff, or, well, I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2021-01-14 Thread Zhihong Yu

Hi,
For v32-0008-Prepare-foreign-transactions-at-commit-time.patch :

+   boolhave_notwophase = false;

Maybe name the variable have_no_twophase so that it is easier to read.

+* Two-phase commit is not required if the number of servers performed

performed -> performing

+errmsg("cannot process a distributed transaction that has
operated on a foreign server that does not support two-phase commit
protocol"),
+errdetail("foreign_twophase_commit is \'required\' but the
transaction has some foreign servers which are not capable of two-phase
commit")));

The lines are really long. Please wrap into more lines.



On Wed, Jan 13, 2021 at 9:50 PM Masahiko Sawada 
wrote:

> On Thu, Jan 7, 2021 at 11:44 AM Zhihong Yu  wrote:
> >
> > Hi,
>
> Thank you for reviewing the patch!
>
> > For pg-foreign/v31-0004-Add-PrepareForeignTransaction-API.patch :
> >
> > However these functions are not neither committed nor aborted at
> >
> > I think the double negation was not intentional. Should be 'are neither
> ...'
>
> Fixed.
>
> >
> > For FdwXactShmemSize(), is another MAXALIGN(size) needed prior to the
> return statement ?
>
> Hmm, you mean that we need MAXALIGN(size) after adding the size of
> FdwXactData structs?
>
> Size
> FdwXactShmemSize(void)
> {
> Sizesize;
>
> /* Size for foreign transaction information array */
> size = offsetof(FdwXactCtlData, fdwxacts);
> size = add_size(size, mul_size(max_prepared_foreign_xacts,
>sizeof(FdwXact)));
> size = MAXALIGN(size);
> size = add_size(size, mul_size(max_prepared_foreign_xacts,
>sizeof(FdwXactData)));
>
> return size;
> }
>
> I don't think we need to do that. Looking at other similar code such
> as TwoPhaseShmemSize() doesn't do that. Why do you think we need that?
>
> >
> > +   fdwxact = FdwXactInsertFdwXactEntry(xid, fdw_part);
> >
> > For the function name, Fdw and Xact appear twice, each. Maybe one of
> them can be dropped ?
>
> Agreed. Changed to FdwXactInsertEntry().
>
> >
> > +* we don't need to anything for this participant because all
> foreign
> >
> > 'need to' -> 'need to do'
>
> Fixed.
>
> >
> > +   else if (TransactionIdDidAbort(xid))
> > +   return FDWXACT_STATUS_ABORTING;
> > +
> > the 'else' can be omitted since the preceding if would return.
>
> Fixed.
>
> >
> > +   if (max_prepared_foreign_xacts <= 0)
> >
> > I wonder when the value for max_prepared_foreign_xacts would be negative
> (and whether that should be considered an error).
> >
>
> Fixed to (max_prepared_foreign_xacts == 0)
>
> Attached the updated version patch set.
>
> Regards,
>
> --
> Masahiko Sawada
> EnterpriseDB:  https://www.enterprisedb.com/
>

Re: adding wait_start column to pg_locks

2021-01-14 Thread Robert Haas

On Wed, Jan 13, 2021 at 10:40 PM Ian Lawrence Barwick  wrote:
> It looks like the logical place to store the value is in the PROCLOCK
> structure; ...

That seems surprising, because there's one PROCLOCK for every
combination of a process and a lock. But, a process can't be waiting
for more than one lock at the same time, because once it starts
waiting to acquire the first one, it can't do anything else, and thus
can't begin waiting for a second one. So I would have thought that
this would be recorded in the PROC.

But I haven't looked at the patch so maybe I'm dumb.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: vacuum_cost_page_miss default value and modern hardware

2021-01-14 Thread Robert Haas

On Thu, Jan 14, 2021 at 12:29 PM Magnus Hagander  wrote:
> +1 for this in principle.

I'm not opposed to this change and I agree that the relative expense
of dirtying a page is higher than what the current defaults suggest.
So I also think Peter is going in the right general direction, though
like you I am not sure about the specifics.

In practice, most users get hosed by not vacuuming aggressively
enough, rather than by vacuuming too aggressively. For instance,
suppose you have a table and workload such that the table needs to be
vacuumed once per hour to maintain good performance. As you make the
table bigger and bigger, you will eventually reach a size where the
configured cost limits aren't high enough to permit this to happen.
The system has no option to disregard the configured limit, even for
an emergency autovacuum. Eventually the user is forced into an outage
either by the table becoming so bloated that VACUUM FULL is required,
or by running out of XIDs. It seems bad that we ship a set of default
settings that are guaranteed to hose any database with a reasonable
number of updates once the database size exceeds some limit. The fact
that we decreased autovacuum_cost_delay by 10x increased the limit by
10x, which is good, but the problem remains. I don't know exactly how
to do better, and any proposal in that area would be much more
complicated than what Peter is proposing here, but it's something to
think about.

There are also users I've seen get hosed by vacuuming too
aggressively. I have seen this happen in two ways. One is too much
dirty data. The other is too much read I/O, pushing hot data out of
the cache, leading to a storm of random I/O later when the foreground
workload needs to get that stuff back, basically killing the system,
sometimes for hours, while it tries to get back the stuff it lost.
That might seem like an argument against further raising the possible
I/O rate, which would be the effect of the change Peter is proposing,
but that's not really my position. I think the bigger problem with all
this is that it's too hard to configure; almost nobody can work out
what a given set of configuration parameters actually means in MB/s or
GB/hour. In the past I've proposed that maybe we should redesign this
whole system to work in those kinds of units, which people actually
understand, but I don't know if that's the right idea. Still another
approach would be to try to reduce the degree to which the cache gets
trashed, or make it have less harmful effect on future performance by
reading things back in more efficiently. I don't really know.

TL;DR: This change is fine with me, but this whole system has much
deeper issues.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

I wrote:
> It seems like the more productive approach would be to try to identify
> the right sysroot to use.  I wonder if there is some less messy way
> to find out the compiler's default sysroot than to scrape it out of
> -v output.

This is, of course, not terribly well documented by Apple.  But
Mr. Google suggests that "xcrun --show-sdk-path" might serve.
What does that print for you?

regards, tom lane

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

Sergey Shinderuk  writes:
> On 14.01.2021 18:42, Tom Lane wrote:
>>> I noticed that "cc" invoked from command line uses:
>>> -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk

>> Hm, how did you determine that exactly?

> % cc -v tmp.c
> ...
> -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk 

Okay, interesting.  On my Catalina machine, I see

-isysroot 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

which is also a 10.15 SDK, since I haven't upgraded Xcode past 12.0.
I wonder if that would change if I did upgrade (but I don't plan to
risk it, since this is my only remaining Catalina install).

After considerable playing around, I'm guessing that the reason
-no_weak_imports doesn't help is that it rejects calls that are
marked as weak references on the *calling* side.  Since AC_CHECK_FUNCS
doesn't bother to #include the relevant header file, the compiler
doesn't know that preadv() ought to be marked as a weak reference.
Then, when the test program gets linked against the stub libc that's
provided by the SDK, there is a version of preadv() there so no link
failure occurs.  (There are way more moving parts in this weak-reference
thing than I'd realized.)

It seems like the more productive approach would be to try to identify
the right sysroot to use.  I wonder if there is some less messy way
to find out the compiler's default sysroot than to scrape it out of
-v output.

Another thing I've been realizing while poking at this is that we
might not need to set -isysroot explicitly at all, which would then
lead to the compiler using its default sysroot automatically.
In some experimentation, it seems like what we need PG_SYSROOT for
is just for configure to be able to find tclConfig.sh and the Perl
header files.  So at this point I'm tempted to try ripping that
out altogether.  If you remove the lines in src/template/darwin
that inject PG_SYSROOT into CPPFLAGS and LDFLAGS, do things
work for you?

regards, tom lane

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Surafel Temesgen

Hi Ryan

On Fri, Jan 8, 2021 at 7:50 PM Ryan Lambert  wrote:

> I prefer to have them hidden by default.  This was mentioned up-thread
> with no decision, it seems the standard is ambiguous.  MS SQL appears to
> have flip-flopped on this decision [1].
>
>
I will change it to hidden by default if there are no objection

regards
Surafel

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Surafel Temesgen

Hi Simon,
Thank you for all the work you does

On Mon, Jan 11, 2021 at 5:02 PM Simon Riggs 
wrote:

>
>
> * Anomalies around use of CURRENT_TIMESTAMP are not discussed or resolved.
> Probably need to add a test that end_timestamp > start_timestamp or ERROR,
> which effectively enforces serializability.
>
>

This scenario doesn't happen. There are no possibility of a record being
deleted or updated before inserting


> * No discussion, comments or tests around freezing and whether that
> causes issues here
>
>
This feature introduced no new issue regarding freezing. Adding
the doc about the table size growth because of a retention of old record
seems
enough for me


>
> * ALTER TABLE needs some work, it's a bit klugey at the moment and
> needs extra tests.
> Should refuse DROP COLUMN on a system time column, but currently doesn't
>
> * UPDATE foo SET start_timestamp = DEFAULT should fail but currently
> doesn't
>
>
okay i will fix it

regards
Surafel

Re: vacuum_cost_page_miss default value and modern hardware

2021-01-14 Thread Magnus Hagander

On Thu, Jan 14, 2021 at 1:24 AM Peter Geoghegan  wrote:
>
> On Wed, Jan 6, 2021 at 7:43 PM Peter Geoghegan  wrote:
> > More concretely, we could perhaps lower vacuum_cost_page_miss to 5. It
> > has had the value as 10 as its default since 2004 (just like
> > vacuum_cost_page_dirty, whose default has also not been changed since
> > the start). These defaults were decided in a time when nbtree VACUUM
> > could do lots of random I/O, there was no visibility map, etc. So this
> > refresh is not just about hardware.
>
> Attached patch lowers vacuum_cost_page_miss to 3. I think that this
> change in the default is both likely to be helpful in medium to large
> installations, and unlikely to cause harm in small installations. If
> I/O for reads made by VACUUM is naturally very slow (even in the
> common case where it's entirely sequential), then that will naturally
> provide additional throttling.

+1 for this in principle.

Do you have any actual metrics between specifically choosing the value
3? Or is that off a gut feeling?


-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-14 Thread Dian M Fay

On Thu Jan 14, 2021 at 10:04 AM EST, Dmitry Dolgov wrote:
> > On Tue, Jan 12, 2021 at 08:02:59PM +0100, Pavel Stehule wrote:
> > ne 10. 1. 2021 v 19:52 odesílatel Pavel Stehule 
> > napsal:
> >
> > I tested behaviour and I didn't find anything other than the mentioned
> > issue.
> >
> > Now I can check this feature from plpgsql, and it is working. Because there
> > is no special support in plpgsql runtime, the update of jsonb is
> > significantly slower than in update of arrays, and looks so update of jsonb
> > has O(N2) cost. I don't think it is important at this moment - more
> > important is fact, so I didn't find any memory problems.
>
> Thanks for testing. Regarding updates when the structure doesn't match
> provided path as I've mentioned I don't have strong preferences, but on
> the second though probably more inclined for returning an error in this
> case. Since there are pros and cons for both suggestions, it could be
> decided by vote majority between no update (Dian) or an error (Pavel,
> me) options. Any +1 to one of the options from others?
>
> Other than that, since I've already posted the patch for returning an
> error option, it seems that the only thing left is to decide with which
> version to go.

The trigger issue (which I did verify) makes the "no update" option
unworkable imo, JavaScript's behavior notwithstanding. But it should be
called out very clearly in the documentation, since it does depart from
what people more familiar with that behavior may expect. Here's a quick
draft, based on your v44 patch:

 jsonb data type supports array-style subscripting expressions
 to extract or update particular elements. It's possible to use multiple
 subscripting expressions to extract nested values. In this case, a chain of
 subscripting expressions follows the same rules as the
 path argument in jsonb_set function,
 e.g. in case of arrays it is a 0-based operation or that negative integers
 that appear in path count from the end of JSON arrays.
 The result of subscripting expressions is always of the jsonb data type.

 UPDATE statements may use subscripting in the 
 SET clause to modify jsonb values. Every
 affected value must conform to the path defined by the subscript(s). If the
 path cannot be followed to its end for any individual value (e.g.
 val['a']['b']['c'] where val['a'] or
 val['b'] is null, a string, or a number), an error is
 raised even if other values do conform.

 An example of subscripting syntax:

Re: WIP: System Versioned Temporal Table

2021-01-14 Thread Surafel Temesgen

Hi Andrew,
On Fri, Jan 8, 2021 at 4:38 PM Andrew Dunstan  wrote:

>
> On 1/8/21 7:33 AM, Simon Riggs wrote:
> >
> > * What happens if you ask for a future time?
> > It will give an inconsistent result as it scans, so we should refuse a
> > query for time > current_timestamp.
>
>
> That seems like a significant limitation. Can we fix it instead of
> refusing the query?
>
>

Querying  a table without system versioning with a value of non existent
data returns no record rather than error out or have other behavior. i
don't
understand the needs for special treatment here

regards
Surafel

Re: psql \df choose functions by their arguments

2021-01-14 Thread Greg Sabino Mullane

Thanks for the feedback: new version v5 (attached) has int8, plus the
suggested code formatting.

Cheers,
Greg


v5-psql-df-pick-function-by-type.patch
Description: Binary data

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Sergey Shinderuk


On 14.01.2021 18:42, Tom Lane wrote:

I noticed that "cc" invoked from command line uses:
-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk


Hm, how did you determine that exactly?



% echo 'int main(void){}' >tmp.c
% cc -v tmp.c
Apple clang version 12.0.0 (clang-1200.0.32.28)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin


"/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang" 
-cc1 -triple x86_64-apple-macosx10.15.0 -Wdeprecated-objc-isa-usage 
-Werror=deprecated-objc-isa-usage -Werror=implicit-function-declaration 
-emit-obj -mrelax-all -disable-free -disable-llvm-verifier 
-discard-value-names -main-file-name tmp.c -mrelocation-model pic 
-pic-level 2 -mthread-model posix -mframe-pointer=all -fno-strict-return 
-masm-verbose -munwind-tables -target-sdk-version=10.15.6 
-fcompatibility-qualified-id-block-type-checking -target-cpu penryn 
-dwarf-column-info -debugger-tuning=lldb -target-linker-version 609.8 -v 
-resource-dir 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0 
-isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk 
-I/usr/local/include -internal-isystem 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/usr/local/include 
-internal-isystem 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/include 
-internal-externc-isystem 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/usr/include 
-internal-externc-isystem 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include 
-Wno-reorder-init-list -Wno-implicit-int-float-conversion 
-Wno-c99-designator -Wno-final-dtor-non-final-class -Wno-extra-semi-stmt 
-Wno-misleading-indentation -Wno-quoted-include-in-framework-header 
-Wno-implicit-fallthrough -Wno-enum-enum-conversion 
-Wno-enum-float-conversion -fdebug-compilation-dir /Users/shinderuk 
-ferror-limit 19 -fmessage-length 238 -stack-protector 1 -fstack-check 
-mdarwin-stkchk-strong-link -fblocks -fencode-extended-block-signature 
-fregister-global-dtors-with-atexit -fgnuc-version=4.2.1 
-fobjc-runtime=macosx-10.15.0 -fmax-type-align=16 
-fdiagnostics-show-option -fcolor-diagnostics -o 
/var/folders/8x/jvqv7hyd5h98m7tz2zm9r0ycgn/T/tmp-91fb5e.o -x c tmp.c
clang -cc1 version 12.0.0 (clang-1200.0.32.28) default target 
x86_64-apple-darwin19.6.0
ignoring nonexistent directory 
"/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/usr/local/include"
ignoring nonexistent directory 
"/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/Library/Frameworks"

#include "..." search starts here:
#include <...> search starts here:
 /usr/local/include

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/include
 /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/usr/include

/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include

/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk/System/Library/Frameworks 
(framework directory)

End of search list.

"/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ld" 
-demangle -lto_library 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/libLTO.dylib 
-no_deduplicate -dynamic -arch x86_64 -platform_version macos 10.15.0 
10.15.6 -syslibroot 
/Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -o a.out 
-L/usr/local/lib 
/var/folders/8x/jvqv7hyd5h98m7tz2zm9r0ycgn/T/tmp-91fb5e.o -lSystem 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/12.0.0/lib/darwin/libclang_rt.osx.a

NOT VALID for Unique Indexes

2021-01-14 Thread Simon Riggs

As you may be aware the NOT VALID qualifier currently only applies to
CHECK and FK constraints, but not yet to unique indexes. I have had
customer requests to change that.

It's a reasonably common requirement to be able to change an index
to/from a unique index, i.e. Unique -> NonUnique or NonUnique to
Unique. Previously, it was easy enough to do that using a catalog
update, but with security concerns  and the fact that the optimizer
uses the uniqueness to optimize queries means that there is a gap in
our support. We obviously need to scan the index to see if it actually
can be marked as unique.

In terms of locking we need to exclude writes while we add uniqueness,
so scanning the index to check it is unique would cause problems. So
we need to do the same thing as we do with other constraint types: add
the constraint NOT VALID in one transaction and then later validate it
in a separate transaction (if ever).

I present a WIP patch to show it's a small patch to change Uniqueness
for an index, with docs and tests.

ALTER INDEX SET [NOT] UNIQUE [NOT VALID]
ALTER INDEX VALIDATE UNIQUE

It doesn't do the index validation scan (yet), but I wanted to check
acceptability, syntax and requirements before I do that.

I can also add similar syntax for UNIQUE and PK constraints.

Thoughts please?

-- 
Simon Riggshttp://www.EnterpriseDB.com/


alter_index_set_unique_not_valid.v4.patch
Description: Binary data

Re: Stronger safeguard for archive recovery not to miss data

2021-01-14 Thread Laurenz Albe

On Tue, 2020-12-08 at 03:08 +, osumi.takami...@fujitsu.com wrote:
> On Thursday, November 26, 2020 4:29 PM
> Kyotaro Horiguchi  wrote:
> > At Thu, 26 Nov 2020 07:18:39 +, "osumi.takami...@fujitsu.com"
> >  wrote in
> > > The attached patch is intended to prevent a scenario that archive
> > > recovery hits WALs which come from wal_level=minimal and the server
> > > continues to work, which was discussed in the thread of [1].
> > 
> > Perhaps we need the TAP test that conducts the above steps.
>
> I added the TAP tests to reproduce and share the result,
> using the case of 6-(1) described above.
> Here, I created a new file for it because the purposes of other files in
> src/recovery didn't match the purpose of my TAP tests perfectly.
> If you are dubious about this idea, please have a look at the comments
> in each file.
> 
> When the attached patch is applied,
> my TAP tests are executed like other ones like below.
> 
> t/018_wal_optimize.pl  ok
> t/019_replslot_limit.pl .. ok
> t/020_archive_status.pl .. ok
> t/021_row_visibility.pl .. ok
> t/022_archive_recovery.pl  ok
> All tests successful.
> 
> Also, I confirmed that there's no regression by make check-world.
> Any comments ?

The patch applies and passes regression tests, as well as the new TAP test.

I think this should be backpatched, since it fixes a bug.

I am not quite happy with the message:

FATAL:  WAL was generated with wal_level=minimal, data may be missing
HINT:  This happens if you temporarily set wal_level=minimal without taking a 
new base backup.

This sounds too harmless to me and doesn't give the user a clue
what would be the best way to proceed.

Suggestion:

FATAL:  WAL was generated with wal_level=minimal, cannot continue recovering
DETAIL:  This happens if you temporarily set wal_level=minimal on the primary 
server.
HINT:  Create a new standby from a new base backup after setting 
wal_level=replica.

Yours,
Laurenz Albe

Re: pg_preadv() and pg_pwritev()

2021-01-14 Thread Tom Lane

Sergey Shinderuk  writes:
> I noticed that "cc" invoked from command line uses:
> -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk

Hm, how did you determine that exactly?

regards, tom lane

Re: CheckpointLock needed in CreateCheckPoint()?

2021-01-14 Thread Robert Haas

On Thu, Jan 7, 2021 at 1:02 AM Amul Sul  wrote:
> As per this comment, it seems to be that we don't really need this LW lock. We
> could have something else instead if we are afraid of having multiple
> checkpoints at any given time which isn't possible, btw.

Yeah, I think this lock is useless. In fact, I think it's harmful,
because it makes large sections of the checkpointer code, basically
all of it really, run with interrupts held off for no reason.

It's not impossible that removing the lock could reveal bugs
elsewhere: suppose we have segments of code that actually aren't safe
to interrupt, but because of this LWLock, it never happens. But, if
that happens to be true, I think we should just view those cases as
bugs to be fixed.

One thing that blunts the impact of this quite a bit is that the
checkpointer doesn't use rely much on CHECK_FOR_INTERRUPTS() in the
first place. It has its own machinery: HandleCheckpointerInterrupts().
Possibly that's something we should be looking to refactor somehow as
well.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: outdated references to replication timeout

2021-01-14 Thread John Naylor

On Thu, Jan 14, 2021 at 1:55 AM Michael Paquier  wrote:
>
> On Wed, Jan 13, 2021 at 11:28:55PM +0900, Fujii Masao wrote:
> > On Wed, Jan 13, 2021 at 10:51 PM John Naylor <
john.nay...@enterprisedb.com> wrote:
> >> It is strange, now that I think about it. My thinking was that the
> >> former wording of "replication timeout" was a less literal
> >> reference to the replication_timeout parameter, so I did the same
> >> for wal_sender_timeout. A quick look shows we are not consistent
> >> in the documentation as far as walsender vs. WAL sender. For the
> >> purpose of the patch I agree it should be consistent within a
> >> single message. Maybe the parameter should be spelled exactly as
> >> is, with underscores?
> >
> > I'm ok with that. But there seems no other timeout messages using
> > the parameter name.
>
> Could it be that nothing needs to change here because this refers to
> timeouts with the replication protocol?  The current state of things
> on HEAD does not sound completely incorrect to me either.

It was off enough to cause brief confusion during a customer emergency. To
show a bit more context around one of the proposed corrections:

 timeout = TimestampTzPlusMilliseconds(last_reply_timestamp,
 wal_sender_timeout);

if (wal_sender_timeout > 0 && last_processing >= timeout)
{
/*
* Since typically expiration of replication timeout means
* communication problem, we don't send the error message to the
* standby.
*/
ereport(COMMERROR,
(errmsg("terminating walsender process due to replication timeout")));

My reading is, this is a case where the message didn't keep up with the
change in parameter name.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: POC: postgres_fdw insert batching

2021-01-14 Thread Tomas Vondra



On 1/14/21 2:57 PM, Amit Langote wrote:
> On Thu, Jan 14, 2021 at 21:57 Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 1/14/21 9:58 AM, Amit Langote wrote:
> > Hi,
> >
> > On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
> >  > wrote:
> >> On 1/13/21 3:43 PM, Tomas Vondra wrote:
> >>> Thanks for the report. Yeah, I think there's a missing check in
> >>> ExecInsert. Adding
> >>>
> >>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
> >>>
> >>> solves this. But now I'm wondering if this is the wrong place to
> make
> >>> this decision. I mean, why should we make the decision here,
> when the
> >>> decision whether to have a RETURNING clause is made in
> postgres_fdw in
> >>> deparseReturningList? We don't really know what the other FDWs
> will do,
> >>> for example.
> >>>
> >>> So I think we should just move all of this into
> GetModifyBatchSize. We
> >>> can start with ri_BatchSize = 0. And then do
> >>>
> >>>   if (resultRelInfo->ri_BatchSize == 0)
> >>>     resultRelInfo->ri_BatchSize =
> >>>     
>  resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
> >>>
> >>>   if (resultRelInfo->ri_BatchSize > 1)
> >>>   {
> >>>     ... do batching ...
> >>>   }
> >>>
> >>> The GetModifyBatchSize would always return value > 0, so either
> 1 (no
> >>> batching) or >1 (batching).
> >>>
> >>
> >> FWIW the attached v8 patch does this - most of the conditions are
> moved
> >> to the GetModifyBatchSize() callback.
> >
> > Thanks.  A few comments:
> >
> > * I agree with leaving it up to an FDW to look at the properties of
> > the table and of the operation being performed to decide whether or
> > not to use batching, although maybe BeginForeignModify() is a better
> > place for putting that logic instead of GetModifyBatchSize()?  So, in
> > create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> > being set to match the table's or the server's value for the
> > batch_size option, make it also consider the things that prevent
> > batching and set the execution state's batch_size based on that.
> > GetModifyBatchSize() simply returns that value.
> >
> > * Regarding the timing of calling GetModifyBatchSize() to set
> > ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> > say from ExecInitModifyTable(), right after BeginForeignModify()
> > returns?  I don't quite understand why it is being called from
> > ExecInsert().  Can the batch size change once the execution starts?
> >
> 
> But it should be called just once. The idea is that initially we have
> batch_size=0, and the fist call returns value that is >= 1. So we never
> call it again. But maybe it could be called from BeginForeignModify, in
> which case we'd not need this logic with first setting it to 0 etc.
> 
> 
> Right, although I was thinking that maybe ri_BatchSize itself is not to
> be written to by the FDW.  Not to say that’s doing anything wrong though.
> 
> > * Lastly, how about calling it GetForeignModifyBatchSize() to be
> > consistent with other nearby callbacks?
> >
> 
> Yeah, good point.
> 
> >> I've removed the check for the
> >> BatchInsert callback, though - the FDW knows whether it supports
> that,
> >> and it seems a bit pointless at the moment as there are no other
> batch
> >> callbacks. Maybe we should add an Assert somewhere, though?
> >
> > Hmm, not checking whether BatchInsert() exists may not be good idea,
> > because if an FDW's GetModifyBatchSize() returns a value > 1 but
> > there's no BatchInsert() function to call, ExecBatchInsert() would
> > trip.  I don't see the newly added documentation telling FDW authors
> > to either define both or none.
> >
> 
> Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
> it can't hurt, I guess. I'll ad it back.
> 
> > Regarding how this plays with partitions, I don't think we need
> > ExecGetTouchedPartitions(), because you can get the routed-to
> > partitions using es_tuple_routing_result_relations.  Also, perhaps
> 
> I'm not very familiar with es_tuple_routing_result_relations, but that
> doesn't seem to work. I've replaced the flushing code at the end of
> ExecModifyTable with a loop over es_tuple_routing_result_relations, but
> then some of the rows are missing (i.e. not flushed).
> 
> 
> I should’ve mentioned es_opened_result_relations too which contain
> non-routing result relations.  So I really meant if (proute) then use
> es_tuple_routing_result_relations, else es_opened_result_relations. 
> This should work as long as batching is only used for inserts.
> 

Ah, right. That did the trick.

Attached

Re: [HACKERS] [PATCH] Generic type subscripting

2021-01-14 Thread Dmitry Dolgov

> On Tue, Jan 12, 2021 at 08:02:59PM +0100, Pavel Stehule wrote:
> ne 10. 1. 2021 v 19:52 odesílatel Pavel Stehule 
> napsal:
>
> I tested behaviour and I didn't find anything other than the mentioned
> issue.
>
> Now I can check this feature from plpgsql, and it is working. Because there
> is no special support in plpgsql runtime, the update of jsonb is
> significantly slower than in update of arrays, and looks so update of jsonb
> has O(N2) cost. I don't think it is important at this moment - more
> important is fact, so I didn't find any memory problems.

Thanks for testing. Regarding updates when the structure doesn't match
provided path as I've mentioned I don't have strong preferences, but on
the second though probably more inclined for returning an error in this
case. Since there are pros and cons for both suggestions, it could be
decided by vote majority between no update (Dian) or an error (Pavel,
me) options. Any +1 to one of the options from others?

Other than that, since I've already posted the patch for returning an
error option, it seems that the only thing left is to decide with which
version to go.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Li Japin


> On Jan 14, 2021, at 8:44 PM, Amit Kapila  wrote:
> 
> On Thu, Jan 14, 2021 at 6:02 PM japin  wrote:
>> 
>> On Thu, 14 Jan 2021 at 20:19, Bharath Rupireddy wrote:
>>> On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
 Do we really need to access PUBLICATIONRELMAP in this patch? What if
 we just set it to false in the else condition of (if (publish &&
 (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
 
 Thank for you review. I agree with you, it doesn’t need to access 
 PUBLICATIONRELMAP, since
 We already get the publication oid in GetRelationPublications(relid) [1], 
 which also access
 PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the 
 publish is false, so we
 do not need publish the table.
>>> 
>>> +1. This is enough.
>>> 
 I have another question, the data->publications is a list, when did it has 
 more then one items?
>>> 
>>> IIUC, when the single table is associated with multiple publications,
>>> then data->publications will have multiple entries. Though I have not
>>> tried, we can try having two or three publications for the same table
>>> and verify that.
>>> 
>> 
>> I try add one table into two publications, but the data->publications has 
>> only
>> one item.  Is there something I missed?
>> 
> 
> I think you will have multiple publications in that list when the
> subscriber has subscribed to multiple publications. For example,
> Create Subscription ... Publication pub1, Publication pub2.
> 

Thanks, you are right! When we create a subscription with multiple publications,
the data->publications has more then one items.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: POC: postgres_fdw insert batching

2021-01-14 Thread Amit Langote

On Thu, Jan 14, 2021 at 21:57 Tomas Vondra 
wrote:

> On 1/14/21 9:58 AM, Amit Langote wrote:
> > Hi,
> >
> > On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
> >  wrote:
> >> On 1/13/21 3:43 PM, Tomas Vondra wrote:
> >>> Thanks for the report. Yeah, I think there's a missing check in
> >>> ExecInsert. Adding
> >>>
> >>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
> >>>
> >>> solves this. But now I'm wondering if this is the wrong place to make
> >>> this decision. I mean, why should we make the decision here, when the
> >>> decision whether to have a RETURNING clause is made in postgres_fdw in
> >>> deparseReturningList? We don't really know what the other FDWs will do,
> >>> for example.
> >>>
> >>> So I think we should just move all of this into GetModifyBatchSize. We
> >>> can start with ri_BatchSize = 0. And then do
> >>>
> >>>   if (resultRelInfo->ri_BatchSize == 0)
> >>> resultRelInfo->ri_BatchSize =
> >>>   resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
> >>>
> >>>   if (resultRelInfo->ri_BatchSize > 1)
> >>>   {
> >>> ... do batching ...
> >>>   }
> >>>
> >>> The GetModifyBatchSize would always return value > 0, so either 1 (no
> >>> batching) or >1 (batching).
> >>>
> >>
> >> FWIW the attached v8 patch does this - most of the conditions are moved
> >> to the GetModifyBatchSize() callback.
> >
> > Thanks.  A few comments:
> >
> > * I agree with leaving it up to an FDW to look at the properties of
> > the table and of the operation being performed to decide whether or
> > not to use batching, although maybe BeginForeignModify() is a better
> > place for putting that logic instead of GetModifyBatchSize()?  So, in
> > create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> > being set to match the table's or the server's value for the
> > batch_size option, make it also consider the things that prevent
> > batching and set the execution state's batch_size based on that.
> > GetModifyBatchSize() simply returns that value.
> >
> > * Regarding the timing of calling GetModifyBatchSize() to set
> > ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> > say from ExecInitModifyTable(), right after BeginForeignModify()
> > returns?  I don't quite understand why it is being called from
> > ExecInsert().  Can the batch size change once the execution starts?
> >
>
> But it should be called just once. The idea is that initially we have
> batch_size=0, and the fist call returns value that is >= 1. So we never
> call it again. But maybe it could be called from BeginForeignModify, in
> which case we'd not need this logic with first setting it to 0 etc.


Right, although I was thinking that maybe ri_BatchSize itself is not to be
written to by the FDW.  Not to say that’s doing anything wrong though.

> * Lastly, how about calling it GetForeignModifyBatchSize() to be
> > consistent with other nearby callbacks?
> >
>
> Yeah, good point.
>
> >> I've removed the check for the
> >> BatchInsert callback, though - the FDW knows whether it supports that,
> >> and it seems a bit pointless at the moment as there are no other batch
> >> callbacks. Maybe we should add an Assert somewhere, though?
> >
> > Hmm, not checking whether BatchInsert() exists may not be good idea,
> > because if an FDW's GetModifyBatchSize() returns a value > 1 but
> > there's no BatchInsert() function to call, ExecBatchInsert() would
> > trip.  I don't see the newly added documentation telling FDW authors
> > to either define both or none.
> >
>
> Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
> it can't hurt, I guess. I'll ad it back.
>
> > Regarding how this plays with partitions, I don't think we need
> > ExecGetTouchedPartitions(), because you can get the routed-to
> > partitions using es_tuple_routing_result_relations.  Also, perhaps
>
> I'm not very familiar with es_tuple_routing_result_relations, but that
> doesn't seem to work. I've replaced the flushing code at the end of
> ExecModifyTable with a loop over es_tuple_routing_result_relations, but
> then some of the rows are missing (i.e. not flushed).


I should’ve mentioned es_opened_result_relations too which contain
non-routing result relations.  So I really meant if (proute) then use
es_tuple_routing_result_relations, else es_opened_result_relations.  This
should work as long as batching is only used for inserts.


> it's a good idea to put the "finishing" ExecBatchInsert() calls into a
> > function ExecFinishBatchInsert().  Maybe the logic to choose the
> > relations to perform the finishing calls on will get complicated in
> > the future as batching is added for updates/deletes too and it seems
> > better to encapsulate that in the separate function than have it out
> > in the open in ExecModifyTable().
> >
>
> IMO that'd be an over-engineering at this point. We don't need such
> separate function yet, so why complicate the API? If we need it in the
> future, we can add it.


Fair enough.
-- 
Am

cost_sort vs cost_agg

2021-01-14 Thread Andy Fan

Currently the cost_sort doesn't consider the number of columns to sort,
which
means the cost of SELECT * FROM t ORDER BY a;  equals with the SELECT *
FROM t ORDER BY a, b; which is obviously wrong.  The impact of this is when
we
choose the plan for SELECT DISTINCT * FROM t ORDER BY c between:

 Sort
   Sort Key: c
   ->  HashAggregate
 Group Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n

and

 Unique
   ->  Sort
 Sort Key: c, a, b, d, e, f, g, h, i, j, k, l, m, n


Since "Sort (c)" has the same cost as  "Sort (c, a, b, d, e, f, g, h, i, j,
k,
l, m, n)", and Unique node on a sorted input is usually cheaper than
HashAggregate, so the later one will win usually which might bad at many
places.

My patch v1 did a simple improvement for cost_sort, which will consider the
number of cols to sort. The main part is below:

cost_sort:
Assert(numSortCols);
/* Include the default cost-per-comparison */
+ comparison_cost += (2.0 * cpu_operator_cost * numSortCols);


However it still chooses a wrong plan in the simple case below.

create table wcols (a int , b int, c int, d int, e int, f int, g int, h
int, i
int, j int, k int, l int, m int, n int);

insert into wcols select i, i , i, i , i, i , i, i, i, i, i, i, i, i from
generate_series(1, 100)i;

select distinct * from wcols order by c;


Optimizer chose HashAggregate with my patch, but it takes 6s. after set
enable_hashagg = off, it takes 2s.


The main reason is both cost_sort and cost_agg doesn't consider the real
hash
function or real sort function, they use cpu_operator_cost instead. If we
really
want to fix this issue, shall we a). figure the real pg_proc.oid for sort
and
hash during planning stage and costing with that? b). in cost_sort, we may
consider the nature order of input data for the ordered column as well?
c).
store the Oids in SortPath and AggPath to avoid the double calculation
during
createPlan stage? or any better suggestion?

Thanks

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


v1-0001-Improve-the-cost_sort-v1.patch
Description: Binary data

Re: ResourceOwner refactoring

2021-01-14 Thread Heikki Linnakangas


On 14/01/2021 12:15, kuroda.hay...@fujitsu.com wrote:

I put some comments.


Thanks for the review!


Throughout, some components don’t have helper functions.
(e.g. catcache has ResOwnerReleaseCatCache, but tupdesc doesn't.)
I think it should be unified.


Hmm. ResOwnerReleaseTupleDesc() does exist, those functions are needed 
for the callbacks. I think you meant the wrappers around 
ResourceOwnerRemember and ResourceOwnerForget, like 
ResourceOwnerRememberCatCacheRef(). I admit those are not fully 
consistent: I didn't bother with the wrapper functions when there is 
only one caller.



[catcache.c]

+/* support for catcache refcount management */
+static inline void
+ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner)
+{
+   ResourceOwnerEnlarge(owner);
+}


This function is not needed.


Removed.


[syscache.c]

-static CatCache *SysCache[SysCacheSize];
+ CatCache *SysCache[SysCacheSize];


Is it right? Compilation is done even if this variable is static...


Fixed. (It was a leftover from when I was playing with Kyotaro's 
"catcachebench" tool from another thread).



[fd.c, dsm.c]
In these files helper functions are implemented as the define directive.
Could you explain the reason? For the performance?


No particular reason. I turned them all into macros for consistency.


Previously, this was OK, because resources AAA and BBB were kept in
separate arrays. But after this patch, it's not guaranteed that the
ResourceOwnerRememberBBB() will find an empty slot.

I don't think we do that currently, but to be sure, I'm planning to grep
ResourceOwnerRemember and look at each call carefullly. And perhaps we
can add an assertion for this, although I'm not sure where.


Indeed, but I think this line works well, isn't it?


Assert(owner->narr < RESOWNER_ARRAY_SIZE);


That catches cases where you actually overrun the array, but it doesn't 
catch unsafe patterns when there happens to be enough space left in the 
array. For example, if you have code like this:


/* Make sure there's room for one more entry, but remember *two* things */
ResourceOwnerEnlarge();
ResourceOwnerRemember(foo);
ResourceOwnerRemember(bar);

That is not safe, but it would only fail the assertion if the first 
ResourceOwnerRemember() call happens to consume the last remaining slot 
in the array.



I checked all the callers of ResourceOwnerEnlarge() to see if they're 
safe. A couple of places seemed a bit suspicious. I fixed them by moving 
the ResourceOwnerEnlarge() calls closer to the ResourceOwnerRemember() 
calls, so that it's now easier to see that they are correct. See first 
attached patch.


The second patch is an updated version of the main patch, fixing all the 
little things you and Michael pointed out since the last patch version.


I've been working on performance testing too. I'll post more numbers 
later, but preliminary result from some micro-benchmarking suggests that 
the new code is somewhat slower, except in the common case that the 
object to remember and forget fits in the array. When running the 
regression test suite, about 96% of ResourceOwnerForget() calls fit in 
the array. I think that's acceptable.


- Heikki
>From 915ef70c8e227fe1c9d403bf6414d54b6673ddae Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 13 Jan 2021 12:21:28 +0200
Subject: [PATCH v4 1/2] Move a few ResourceOwnerEnlarge() calls for safety and
 clarity.

These are functions where quite a lot of things happen between the
ResourceOwnerEnlarge and ResourceOwnerRemember calls. It's important that
there are no unrelated ResourceOwnerRemember() calls in the code
inbetween, otherwise the entry reserved by the ResourceOwnerEnlarge() call
might be used up by the intervening ResourceOwnerRemember() and not be
available at the intended ResourceOwnerRemember() call anymore. The longer
the code path between them is, the harder it is to verify that.

In bufmgr.c, there is a function similar to ResourceOwnerEnlarge(),
to ensure that the private refcount array has enough space. The
ReservePrivateRefCountEntry() calls, analogous to ResourceOwnerEnlarge(),
were made at different places than the ResourceOwnerEnlarge() calls.
Move the ResourceOwnerEnlarge() calls together with the
ReservePrivateRefCountEntry() calls for consistency.
---
 src/backend/storage/buffer/bufmgr.c   | 39 +++
 src/backend/storage/buffer/localbuf.c |  3 +++
 src/backend/utils/cache/catcache.c| 13 ++---
 3 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092f..cde869e7d64 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -738,9 +738,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	*hit = false;
 
-	/* Make sure we will have room to remember the buffer pin */
-	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
-
 	isExtend = (blockNum == P_NEW);
 
 	TRACE_

Re: Is Recovery actually paused?

2021-01-14 Thread Yugo NAGATA

On Wed, 13 Jan 2021 17:49:43 +0530
Dilip Kumar  wrote:

> On Wed, Jan 13, 2021 at 3:35 PM Dilip Kumar  wrote:
> >
> > On Wed, Jan 13, 2021 at 3:27 PM Yugo NAGATA  wrote:
> > >
> > > On Thu, 10 Dec 2020 11:25:23 +0530
> > > Dilip Kumar  wrote:
> > >
> > > > > > However, I wonder users don't expect pg_is_wal_replay_paused to 
> > > > > > wait.
> > > > > > Especially, if max_standby_streaming_delay is -1, this will be 
> > > > > > blocked forever,
> > > > > > although this setting may not be usual. In addition, some users may 
> > > > > > set
> > > > > > recovery_min_apply_delay for a large.  If such users call 
> > > > > > pg_is_wal_replay_paused,
> > > > > > it could wait for a long time.
> > > > > >
> > > > > > At least, I think we need some descriptions on document to explain
> > > > > > pg_is_wal_replay_paused could wait while a time.
> > > > >
> > > > > Ok
> > > >
> > > > Fixed this, added some comments in .sgml as well as in function header
> > >
> > > Thank you for fixing this.
> > >
> > > Also, is it better to fix the description of pg_wal_replay_pause from
> > > "Pauses recovery." to "Request to pause recovery." in according with
> > > pg_is_wal_replay_paused?
> >
> > Okay
> >
> > >
> > > > > > Also, how about adding a new boolean argument to 
> > > > > > pg_is_wal_replay_paused to
> > > > > > control whether this waits for recovery to get paused or not? By 
> > > > > > setting its
> > > > > > default value to true or false, users can use the old format for 
> > > > > > calling this
> > > > > > and the backward compatibility can be maintained.
> > > > >
> > > > > So basically, if the wait_recovery_pause flag is false then we will
> > > > > immediately return true if the pause is requested?  I agree that it is
> > > > > good to have an API to know whether the recovery pause is requested or
> > > > > not but I am not sure is it good idea to make this API serve both the
> > > > > purpose?  Anyone else have any thoughts on this?
> > > > >
> > >
> > > I think the current pg_is_wal_replay_paused() already has another purpose;
> > > this waits recovery to actually get paused. If we want to limit this API's
> > > purpose only to return the pause state, it seems better to fix this to 
> > > return
> > > the actual state at the cost of lacking the backward compatibility. If we 
> > > want
> > > to know whether pause is requested, we may add a new API like
> > > pg_is_wal_replay_paluse_requeseted(). Also, if we want to wait recovery 
> > > to actually
> > > get paused, we may add an option to pg_wal_replay_pause() for this 
> > > purpose.
> > >
> > > However, this might be a bikeshedding. If anyone don't care that
> > > pg_is_wal_replay_paused() can make user wait for a long time, I don't 
> > > care either.
> >
> > I don't think that it will be blocked ever, because
> > pg_wal_replay_pause is sending the WakeupRecovery() which means the
> > recovery process will not be stuck on waiting for the WAL.

Yes, there is no stuck on waiting for the WAL. However, it can be stuck during 
resolving
a recovery conflict. The process could wait for max_standby_streaming_delay or
max_standby_archive_delay at most before recovery get completely paused.

Also, it could wait for recovery_min_apply_delay if it has a valid value. It is 
possible
that a user set this parameter to a large value, so it could wait for a long 
time. However,
this will be avoided by calling recoveryPausesHere() or 
CheckAndSetRecoveryPause() in
recoveryApplyDelay().

> > > > > > As another comment, while pg_is_wal_replay_paused is blocking, I 
> > > > > > can not cancel
> > > > > > the query. I think CHECK_FOR_INTERRUPTS() is necessary in the 
> > > > > > waiting loop.
> > >
> > > How about this fix? I think users may want to cancel 
> > > pg_is_wal_replay_paused() during
> > > this is blocking.
> >
> > Yeah, we can do this.  I will send the updated patch after putting
> > some more thought into these comments.  Thanks again for the feedback.
> >
> 
> Please find the updated patch.

Thanks.  I confirmed that I can cancel pg_is_wal_repaly_paused() during stuck.


Although it is a very trivial comment, I think that the new line before
HandleStartupProcInterrupts() is unnecessary.

@@ -6052,12 +6062,20 @@ recoveryPausesHere(bool endOfRecovery)
(errmsg("recovery has paused"),
 errhint("Execute pg_wal_replay_resume() to 
continue.")));
 
-   while (RecoveryIsPaused())
+   while (RecoveryPauseRequested())
{
+
HandleStartupProcInterrupts();


Regards,
Yugo Nagata

-- 
Yugo NAGATA

Re: POC: postgres_fdw insert batching

2021-01-14 Thread Tomas Vondra

On 1/14/21 9:58 AM, Amit Langote wrote:
> Hi,
> 
> On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
>  wrote:
>> On 1/13/21 3:43 PM, Tomas Vondra wrote:
>>> Thanks for the report. Yeah, I think there's a missing check in
>>> ExecInsert. Adding
>>>
>>>   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
>>>
>>> solves this. But now I'm wondering if this is the wrong place to make
>>> this decision. I mean, why should we make the decision here, when the
>>> decision whether to have a RETURNING clause is made in postgres_fdw in
>>> deparseReturningList? We don't really know what the other FDWs will do,
>>> for example.
>>>
>>> So I think we should just move all of this into GetModifyBatchSize. We
>>> can start with ri_BatchSize = 0. And then do
>>>
>>>   if (resultRelInfo->ri_BatchSize == 0)
>>> resultRelInfo->ri_BatchSize =
>>>   resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
>>>
>>>   if (resultRelInfo->ri_BatchSize > 1)
>>>   {
>>> ... do batching ...
>>>   }
>>>
>>> The GetModifyBatchSize would always return value > 0, so either 1 (no
>>> batching) or >1 (batching).
>>>
>>
>> FWIW the attached v8 patch does this - most of the conditions are moved
>> to the GetModifyBatchSize() callback.
> 
> Thanks.  A few comments:
> 
> * I agree with leaving it up to an FDW to look at the properties of
> the table and of the operation being performed to decide whether or
> not to use batching, although maybe BeginForeignModify() is a better
> place for putting that logic instead of GetModifyBatchSize()?  So, in
> create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
> being set to match the table's or the server's value for the
> batch_size option, make it also consider the things that prevent
> batching and set the execution state's batch_size based on that.
> GetModifyBatchSize() simply returns that value.
> 
> * Regarding the timing of calling GetModifyBatchSize() to set
> ri_BatchSize, I wonder if it wouldn't be better to call it just once,
> say from ExecInitModifyTable(), right after BeginForeignModify()
> returns?  I don't quite understand why it is being called from
> ExecInsert().  Can the batch size change once the execution starts?
> 

But it should be called just once. The idea is that initially we have
batch_size=0, and the fist call returns value that is >= 1. So we never
call it again. But maybe it could be called from BeginForeignModify, in
which case we'd not need this logic with first setting it to 0 etc.

> * Lastly, how about calling it GetForeignModifyBatchSize() to be
> consistent with other nearby callbacks?
> 

Yeah, good point.

>> I've removed the check for the
>> BatchInsert callback, though - the FDW knows whether it supports that,
>> and it seems a bit pointless at the moment as there are no other batch
>> callbacks. Maybe we should add an Assert somewhere, though?
> 
> Hmm, not checking whether BatchInsert() exists may not be good idea,
> because if an FDW's GetModifyBatchSize() returns a value > 1 but
> there's no BatchInsert() function to call, ExecBatchInsert() would
> trip.  I don't see the newly added documentation telling FDW authors
> to either define both or none.
> 

Hmm. The BatchInsert check seemed somewhat unnecessary to me, but OTOH
it can't hurt, I guess. I'll ad it back.

> Regarding how this plays with partitions, I don't think we need
> ExecGetTouchedPartitions(), because you can get the routed-to
> partitions using es_tuple_routing_result_relations.  Also, perhaps

I'm not very familiar with es_tuple_routing_result_relations, but that
doesn't seem to work. I've replaced the flushing code at the end of
ExecModifyTable with a loop over es_tuple_routing_result_relations, but
then some of the rows are missing (i.e. not flushed).

> it's a good idea to put the "finishing" ExecBatchInsert() calls into a
> function ExecFinishBatchInsert().  Maybe the logic to choose the
> relations to perform the finishing calls on will get complicated in
> the future as batching is added for updates/deletes too and it seems
> better to encapsulate that in the separate function than have it out
> in the open in ExecModifyTable().
> 

IMO that'd be an over-engineering at this point. We don't need such
separate function yet, so why complicate the API? If we need it in the
future, we can add it.

> (Sorry about being so late reviewing this.)

thanks

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: profiling

2021-01-14 Thread Heikki Linnakangas


On 14/01/2021 12:14, Павел Еремин wrote:

Hello. I am interested in the question of profiling. As far as I
understand, this issue has not been resolved in postgres. And I see
the urgency of this problem. Are there any postgres development plans
for this functionality?


What exactly do you mean by profiling? You can profile internal C 
functions with 'perf' or 'gprof' or similar. For profiling applications 
using PostgreSQL, pg_stat_statement() is very helpful. And there is a 
plugin for profiling PL/pgSQL functions, IIRC pgAdmin has a GUI for that.


- Heikki

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Amit Kapila

On Thu, Jan 14, 2021 at 6:02 PM japin  wrote:
>
> On Thu, 14 Jan 2021 at 20:19, Bharath Rupireddy wrote:
> > On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
> >> Do we really need to access PUBLICATIONRELMAP in this patch? What if
> >> we just set it to false in the else condition of (if (publish &&
> >> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
> >>
> >> Thank for you review. I agree with you, it doesn’t need to access 
> >> PUBLICATIONRELMAP, since
> >> We already get the publication oid in GetRelationPublications(relid) [1], 
> >> which also access
> >> PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the 
> >> publish is false, so we
> >> do not need publish the table.
> >
> > +1. This is enough.
> >
> >> I have another question, the data->publications is a list, when did it has 
> >> more then one items?
> >
> > IIUC, when the single table is associated with multiple publications,
> > then data->publications will have multiple entries. Though I have not
> > tried, we can try having two or three publications for the same table
> > and verify that.
> >
>
> I try add one table into two publications, but the data->publications has only
> one item.  Is there something I missed?
>

I think you will have multiple publications in that list when the
subscriber has subscribed to multiple publications. For example,
Create Subscription ... Publication pub1, Publication pub2.

-- 
With Regards,
Amit Kapila.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread japin



On Thu, 14 Jan 2021 at 20:19, Bharath Rupireddy wrote:
> On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
>> Do we really need to access PUBLICATIONRELMAP in this patch? What if
>> we just set it to false in the else condition of (if (publish &&
>> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
>>
>> Thank for you review. I agree with you, it doesn’t need to access 
>> PUBLICATIONRELMAP, since
>> We already get the publication oid in GetRelationPublications(relid) [1], 
>> which also access
>> PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the publish 
>> is false, so we
>> do not need publish the table.
>
> +1. This is enough.
>
>> I have another question, the data->publications is a list, when did it has 
>> more then one items?
>
> IIUC, when the single table is associated with multiple publications,
> then data->publications will have multiple entries. Though I have not
> tried, we can try having two or three publications for the same table
> and verify that.
>

I try add one table into two publications, but the data->publications has only
one item.  Is there something I missed?

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Bharath Rupireddy

On Thu, Jan 14, 2021 at 5:36 PM Li Japin  wrote
> Do we really need to access PUBLICATIONRELMAP in this patch? What if
> we just set it to false in the else condition of (if (publish &&
> (relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))
>
> Thank for you review. I agree with you, it doesn’t need to access 
> PUBLICATIONRELMAP, since
> We already get the publication oid in GetRelationPublications(relid) [1], 
> which also access
> PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the publish 
> is false, so we
> do not need publish the table.

+1. This is enough.

> I have another question, the data->publications is a list, when did it has 
> more then one items?

IIUC, when the single table is associated with multiple publications,
then data->publications will have multiple entries. Though I have not
tried, we can try having two or three publications for the same table
and verify that.

> 0001 - change as you suggest.
> 0002 - does not change.

Thanks for the updated patch. I will respond to Amit's previous
comment on the 0002 patch soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

Re: Fix typo about WalSndPrepareWrite

2021-01-14 Thread japin



On Thu, 14 Jan 2021 at 15:32, Kyotaro Horiguchi wrote:
> At Thu, 14 Jan 2021 06:46:35 +, Li Japin  wrote in 
>> 
>> On Jan 14, 2021, at 12:56 PM, Ashutosh Bapat 
>> mailto:ashutosh.bapat@gmail.com>> wrote:
>> 
>> Hi Japin,
>> Thanks for the report.
>> 
>> I think that comment is correct. It refers to the following code
>> blocks of XLogSendPhysical()
>> 
>> 2744 /*
>> 2745  * OK to read and send the slice.
>> 2746  */
>> 2747 resetStringInfo(&output_message);
>> 2748 pq_sendbyte(&output_message, 'w');
>> 2749
>> 2750 pq_sendint64(&output_message, startptr);/* dataStart */
>> 2751 pq_sendint64(&output_message, SendRqstPtr); /* walEnd */
>> 2752 pq_sendint64(&output_message, 0);   /* sendtime, filled in last */
>> 
>> 2803  * Fill the send timestamp last, so that it is taken as late
>> as possible.
>> 2804  */
>> 2805 resetStringInfo(&tmpbuf);
>> 2806 pq_sendint64(&tmpbuf, GetCurrentTimestamp());
>> 2807 memcpy(&output_message.data[1 + sizeof(int64) + sizeof(int64)],
>> 2808tmpbuf.data, sizeof(int64));
>> 2809
>> 2810 pq_putmessage_noblock('d', output_message.data, output_message.len);
>> 
>> 
>> After a quick search, I found that WalSndPrepareWrite and WalSndWriteData 
>> are always pairs [1].
>> IIUC the space of sendtime leave by WalSndPrepareWrite, it always fill out 
>> by WalSndWriteData.
>> 
>> 
>> WalSndWriteData() also fills the timestamp there but it may not always
>> be used with WalSndPrepareWrite, at least theoretically. So it's the
>> XLogSendPhysical() that it's referring to.
>
> The two functions are the body of two logical-decoding API
> functions. They are assumed to be called in that order. See
> OutputPluginWrite() for the restriction. The sequence of the two
> logica-decoding funcitons and the code block in XLogSendPhysical are
> parallels in (theoretically) different protocols.
>

Is that mean the sendtime of WalSndPrepareWrite always fill out by 
WalSndWriteData?
If it is, I think we should modify the comment in WalSndPrepareWrite.

-- 
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.

Re: Logical Replication - behavior of ALTER PUBLICATION .. DROP TABLE and ALTER SUBSCRIPTION .. REFRESH PUBLICATION

2021-01-14 Thread Li Japin


On Jan 14, 2021, at 1:25 PM, Amit Kapila 
mailto:amit.kapil...@gmail.com>> wrote:

On Wed, Jan 13, 2021 at 5:40 PM Bharath Rupireddy
mailto:bharath.rupireddyforpostg...@gmail.com>>
 wrote:

On Wed, Jan 13, 2021 at 3:16 PM Bharath Rupireddy
mailto:bharath.rupireddyforpostg...@gmail.com>>
 wrote:

On Wed, Jan 13, 2021 at 2:36 PM japin 
mailto:japi...@hotmail.com>> wrote:
In summary, I feel we need to fix the publisher sending the inserts
even though the table is dropped from the publication, that is the
patch patch proposed by japin. This solves the bug reported in this
thread.

And also, it's good to have the LogicalRepRelMap invalidation fix as a
0002 patch in invalidate_syncing_table_states, subscription_change_cb
and logicalrep_rel_open as proposed by me.

Thoughts?


I think invalidate the LogicalRepRelMap is necessary.  If the table isn't in
subscription, can we remove the LogicalRepRelMapEntry from LogicalRepRelMap?

IIUC, it's not possible to know the relid of the alter
publication...dropped table in the invalidation callbacks
invalidate_syncing_table_states or subscription_change_cb, so it's
better to set state of all the entries to SUBREL_STATE_UNKNOWN, so
that, in the next logicalrep_rel_open call the function
GetSubscriptionRelState will be called and the pg_subscription_rel
will be read freshly.

This is inline with the reasoning specified at [1].

[1] - 
https://www.postgresql.org/message-id/CAFiTN-udwuc_h0TaFrSEKb-Yo1vVvkjz9TDRw7VE3P-KiPSGbQ%40mail.gmail.com

Attaching following two patches:

0001 - Makes publisher to not publish the changes for the alter
publication ... dropped tables. Original patch is by japin, I added
comments, changed the function name and adjusted the code a bit.


Do we really need to access PUBLICATIONRELMAP in this patch? What if
we just set it to false in the else condition of (if (publish &&
(relkind != RELKIND_PARTITIONED_TABLE || pub->pubviaroot)))


Thank for you review. I agree with you, it doesn’t need to access 
PUBLICATIONRELMAP, since
We already get the publication oid in GetRelationPublications(relid) [1], which 
also access
PUBLICATIONRELMAP.  If the current pub->oid does not in pubids, the publish is 
false, so we
do not need publish the table.

I have another question, the data->publications is a list, when did it has more 
then one items?

0001 - change as you suggest.
0002 - does not change.

Please consider v2 for further review.

[1]
List *
GetRelationPublications(Oid relid)
{
List   *result = NIL;
CatCList   *pubrellist;
int i;

/* Find all publications associated with the relation. */
pubrellist = SearchSysCacheList1(PUBLICATIONRELMAP,
 ObjectIdGetDatum(relid));
for (i = 0; i < pubrellist->n_members; i++)
{
HeapTuple   tup = &pubrellist->members[i]->tuple;
Oid pubid = ((Form_pg_publication_rel) GETSTRUCT(tup))->prpubid;

result = lappend_oid(result, pubid);
}

ReleaseSysCacheList(pubrellist);

return result;
}


--
Regrads,
Japin Li.
ChengDu WenWu Information Technology Co.,Ltd.



v2-0001-Fix-ALTER-PUBLICATION.DROP-TABLE-behaviour.patch
Description: v2-0001-Fix-ALTER-PUBLICATION.DROP-TABLE-behaviour.patch


v2-0002-Invalidate-relation-map-cache-in-subscriber-sysca.patch
Description:  v2-0002-Invalidate-relation-map-cache-in-subscriber-sysca.patch

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-14 Thread Bharath Rupireddy

On Thu, Jan 14, 2021 at 3:52 PM Fujii Masao  wrote:
> -   if (!HeapTupleIsValid(tup))
> -   elog(ERROR, "cache lookup failed for user mapping %u", 
> entry->key);
> -   umform = (Form_pg_user_mapping) GETSTRUCT(tup);
> -   server = GetForeignServer(umform->umserver);
> -   ReleaseSysCache(tup);
> +   server = GetForeignServer(entry->serverid);
>
> What about applying only the change about serverid, as a separate patch at
> first? This change itself is helpful to get rid of error "cache lookup failed"
> in pgfdw_reject_incomplete_xact_state_change(). Patch attached.

Right, we can get rid of the "cache lookup failed for user mapping"
error and also storing server oid in the cache entry is helpful for
the new functions we are going to introduce.

serverid_v1.patch looks good to me. Both make check and make
check-world passes on my system.

I will respond to other comments soon.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

profiling

2021-01-14 Thread Павел Еремин

Hello. I am interested in the question of profiling. As far as I
understand, this issue has not been resolved in postgres. And I see
the urgency of this problem. Are there any postgres development plans
for this functionality?

regards, Eremin Pavel.

Re: proposal: schema variables

2021-01-14 Thread Josef Šimánek

I did some testing locally. All runs smoothly, compiled without warning.

Later on (once merged) it would be nice to write down a documentation
page for the whole feature as a set next to documented individual
commands.
It took me a few moments to understand how this works.

I was looking how to create variable with non default value in one
command, but if I understand it correctly, that's not the purpose.
Variable lives in a schema with default value, but you can set it per
session via LET.

Thus "CREATE VARIABLE" statement should not be usually part of
"classic" queries, but it should be threatened more like TABLE or
other DDL statements.

On the other side LET is there to use the variable in "classic" queries.

Does that make sense? Feel free to ping me if any help with
documentation would be needed. I can try to prepare an initial
variables guide once I'll ensure I do  understand this feature well.

PS: Now it is clear to me why it is called "schema variables", not
"session variables".

čt 14. 1. 2021 v 7:36 odesílatel Pavel Stehule  napsal:
>
> Hi
>
> rebase
>
> Regards
>
> Pavel
>

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2021-01-14 Thread Fujii Masao




On 2021/01/09 10:12, Bharath Rupireddy wrote:

On Fri, Jan 8, 2021 at 9:55 AM Bharath Rupireddy
 wrote:

I will make the changes and post a new patch set soon.


Attaching v9 patch set that has addressed the review comments on the
disconnect function returning setof records, documentation changes,
and postgres_fdw--1.0-1.1.sql changes.

Please consider the v9 patch set for further review.


Thanks for updating the patch! I reviewed only 0001 patch.

+   /*
+* Quick exit if the cache has been destroyed in
+* disconnect_cached_connections.
+*/
+   if (!ConnectionHash)
+   return;

This code is not necessary at least in pgfdw_xact_callback() and
pgfdw_subxact_callback()? Because those functions check
"if (!xact_got_connection)" before checking the above condition.

-   if (!HeapTupleIsValid(tup))
-   elog(ERROR, "cache lookup failed for user mapping %u", 
entry->key);
-   umform = (Form_pg_user_mapping) GETSTRUCT(tup);
-   server = GetForeignServer(umform->umserver);
-   ReleaseSysCache(tup);
+   server = GetForeignServer(entry->serverid);

What about applying only the change about serverid, as a separate patch at
first? This change itself is helpful to get rid of error "cache lookup failed"
in pgfdw_reject_incomplete_xact_state_change(). Patch attached.

+   server = GetForeignServerExtended(entry->serverid, true);

Since the type of second argument in GetForeignServerExtended() is bits16,
it's invalid to specify "true" there?

+   if (no_server_conn_cnt > 0)
+   {
+   ereport(WARNING,
+   (errmsg_plural("found an active connection for which 
the foreign server would have been dropped",
+  "found some active 
connections for which the foreign servers would have been dropped",
+  no_server_conn_cnt),
+no_server_conn_cnt > 1 ?
+errdetail("Such connections are discarded at the 
end of remote transaction.")
+: errdetail("Such connection is discarded at the 
end of remote transaction.")));

At least for me, I like returning such connections with "NULL" in server_name
column and "false" in valid column, rather than emitting a warning. Because
which would enable us to count the number of actual foreign connections
easily by using SQL, for example.

+* During the first call, we initialize the function context, get the 
list
+* of active connections using get_connections and store this in the
+* function's memory context so that it can live multiple calls.
+*/
+   if (SRF_IS_FIRSTCALL())

I guess that you used value-per-call mode to make the function return
a set result since you refered to dblink_get_pkey(). But isn't it better to
use materialize mode like dblink_get_notify() does rather than
value-per-call because this function returns not so many records? ISTM
that we can simplify postgres_fdw_get_connections() by using materialize mode.

+   hash_destroy(ConnectionHash);
+   ConnectionHash = NULL;

If GetConnection() is called after ConnectionHash is destroyed,
it initialize the hashtable and registers some callback functions again
even though the same function have already been registered. This causes
same function to be registered as a callback more than once. This is
a bug.

+CREATE FUNCTION postgres_fdw_disconnect ()

Do we really want postgres_fdw_disconnect() with no argument?
IMO postgres_fdw_disconnect() with the server name specified is enough.
But I'd like to hear the opinion about that.


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/postgres_fdw/connection.c 
b/contrib/postgres_fdw/connection.c
index 266f66cc62..eaedfea9f2 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -57,6 +57,7 @@ typedef struct ConnCacheEntry
boolhave_error; /* have any subxacts aborted in 
this xact? */
boolchanging_xact_state;/* xact state change in process 
*/
boolinvalidated;/* true if reconnect is pending */
+   Oid serverid;   /* foreign server OID 
used to get server name */
uint32  server_hashvalue;   /* hash value of foreign server 
OID */
uint32  mapping_hashvalue;  /* hash value of user mapping 
OID */
 } ConnCacheEntry;
@@ -273,6 +274,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping 
*user)
entry->have_error = false;
entry->changing_xact_state = false;
entry->invalidated = false;
+   entry->serverid = server->serverid;
entry->server_hashvalue =
GetSysCacheHashValue1(FO

RE: ResourceOwner refactoring

2021-01-14 Thread kuroda.hay...@fujitsu.com

Hi,

I put some comments.

Throughout, some components don’t have helper functions.
(e.g. catcache has ResOwnerReleaseCatCache, but tupdesc doesn't.)
I think it should be unified.

[catcache.c]
> +/* support for catcache refcount management */
> +static inline void
> +ResourceOwnerEnlargeCatCacheRefs(ResourceOwner owner)
> +{
> +   ResourceOwnerEnlarge(owner);
> +}

This function is not needed.

[syscache.c]
> -static CatCache *SysCache[SysCacheSize];
> + CatCache *SysCache[SysCacheSize];

Is it right? Compilation is done even if this variable is static...

[fd.c, dsm.c]
In these files helper functions are implemented as the define directive.
Could you explain the reason? For the performance?

> Previously, this was OK, because resources AAA and BBB were kept in 
> separate arrays. But after this patch, it's not guaranteed that the 
> ResourceOwnerRememberBBB() will find an empty slot.
>
> I don't think we do that currently, but to be sure, I'm planning to grep 
> ResourceOwnerRemember and look at each call carefullly. And perhaps we 
> can add an assertion for this, although I'm not sure where.

Indeed, but I think this line works well, isn't it?

>   Assert(owner->narr < RESOWNER_ARRAY_SIZE);

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Re: Get memory contexts of an arbitrary backend process

2021-01-14 Thread torikoshia


Since pg_get_target_backend_memory_contexts() waits to dump memory and
it could lead dead lock as below.

  - session1
  BEGIN; TRUNCATE t;

  - session2
  BEGIN; TRUNCATE t; -- wait

  - session1
  SELECT * FROM pg_get_target_backend_memory_contexts(2>); --wait



Thanks for notifying me, Fujii-san.


Attached v8 patch that prohibited calling the function inside 
transactions.



Regards,

--
Atsushi TorikoshiFrom 840185c1ad40cb7bc40333ab38927667c4d48c1d Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi 
Date: Thu, 14 Jan 2021 18:20:43 +0900
Subject: [PATCH v8] After commit 3e98c0bafb28de, we can display the usage of
 the memory contexts using pg_backend_memory_contexts system view. However,
 its target is limited to the process attached to the current session. This
 patch introduces pg_get_target_backend_memory_contexts() and makes it
 possible to collect memory contexts of the specified process.

---
 src/backend/access/transam/xlog.c|   7 +
 src/backend/catalog/system_views.sql |   3 +-
 src/backend/postmaster/pgstat.c  |   3 +
 src/backend/replication/basebackup.c |   3 +
 src/backend/storage/ipc/ipci.c   |   2 +
 src/backend/storage/ipc/procsignal.c |   4 +
 src/backend/storage/lmgr/lwlocknames.txt |   1 +
 src/backend/tcop/postgres.c  |   5 +
 src/backend/utils/adt/mcxtfuncs.c| 742 ++-
 src/backend/utils/init/globals.c |   1 +
 src/bin/initdb/initdb.c  |   3 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |   4 +-
 src/bin/pg_rewind/filemap.c  |   3 +
 src/include/catalog/pg_proc.dat  |  12 +-
 src/include/miscadmin.h  |   1 +
 src/include/pgstat.h |   3 +-
 src/include/storage/procsignal.h |   1 +
 src/include/utils/mcxtfuncs.h|  44 ++
 18 files changed, 821 insertions(+), 21 deletions(-)
 create mode 100644 src/include/utils/mcxtfuncs.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b18257c198..45381c343a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 #include "utils/relmapper.h"
@@ -7009,6 +7010,12 @@ StartupXLOG(void)
 		 */
 		pgstat_reset_all();
 
+		/*
+		 * Reset dump files in pg_memusage, because target processes do
+		 * not exist any more.
+		 */
+		RemoveMemcxtFile(0);
+
 		/*
 		 * If there was a backup label file, it's done its job and the info
 		 * has now been propagated into pg_control.  We must get rid of the
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5d89e77dbe..7419c496b2 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -558,7 +558,8 @@ CREATE VIEW pg_backend_memory_contexts AS
 SELECT * FROM pg_get_backend_memory_contexts();
 
 REVOKE ALL ON pg_backend_memory_contexts FROM PUBLIC;
-REVOKE EXECUTE ON FUNCTION pg_get_backend_memory_contexts() FROM PUBLIC;
+REVOKE EXECUTE ON FUNCTION pg_get_backend_memory_contexts FROM PUBLIC;
+REVOKE EXECUTE ON FUNCTION pg_get_target_backend_memory_contexts FROM PUBLIC;
 
 -- Statistics views
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3f24a33ef1..8eb2d062b0 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4045,6 +4045,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_XACT_GROUP_UPDATE:
 			event_name = "XactGroupUpdate";
 			break;
+		case WAIT_EVENT_DUMP_MEMORY_CONTEXT:
+			event_name = "DumpMemoryContext";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0f54635550..c67e71d79b 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -184,6 +184,9 @@ static const char *const excludeDirContents[] =
 	/* Contents zeroed on startup, see StartupSUBTRANS(). */
 	"pg_subtrans",
 
+	/* Skip memory context dump files. */
+	"pg_memusage",
+
 	/* end of list */
 	NULL
 };
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index f9bbe97b50..18a1dd5a74 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -45,6 +45,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "utils/mcxtfuncs.h"
 #include "utils/snapmgr.h"
 
 /* GUCs */
@@ -267,6 +268,7 @@ CreateSharedMemoryAndSemaphores(void)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	McxtDumpShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 583efaecff..106e

Re: Yet another fast GiST build

2021-01-14 Thread Andrey Borodin



> 14 янв. 2021 г., в 04:47, Peter Geoghegan  написал(а):
> 
> On Tue, Jan 12, 2021 at 5:49 AM Heikki Linnakangas  wrote:
>> I did a bit of cleanup on the function signature. The .sql script
>> claimed that gist_page_items() took bytea as argument, but in reality it
>> was a relation name, as text. I changed it so that it takes a page image
>> as argument, instead of reading the block straight from the index.
>> Mainly to make it consistent with brin_page_items(), if it wasn't for
>> that precedence I might've gone either way on it.
> 
> BTW it would be nice if gist_page_items() had a "dead" boolean output
> argument for the item's LP_DEAD bit, just like bt_page_items().
+1. PFA patch.

> I plan
> on adding some testing for GiST's opportunistic index deletion soon. I
> may also add some of the same enhancements that nbtree got today
> (following commit d168b666).
> 
> This feature was originally heavily based on the nbtree LP_DEAD
> deletion mechanism (now called simple deletion), and I see no reason
> (or at least no good reason) why it shouldn't be possible to keep it
> in sync (except maybe with bottom-up deletion, where that it at least
> isn't straightforward/mechanical).

Sound great!

Best regards, Andrey Borodin.


0001-Add-bool-column-for-LP_DEAF-flag-to-GiST-pageinspect.patch
Description: Binary data

Re: remove unneeded pstrdup in fetch_table_list

2021-01-14 Thread Amit Kapila

On Thu, Jan 14, 2021 at 10:51 AM Tom Lane  wrote:
>
> Michael Paquier  writes:
> > On Thu, Jan 14, 2021 at 01:17:57AM +, Hou, Zhijie wrote:
>  Thanks. I am thinking to backpatch this even though there is no
>  problem reported from any production system. What do you think?
>
> > text_to_cstring() indeed allocates a new string, so the extra
> > allocation is useless.  FWIW, I don't see much point in poking at
> > the stable branches here.
>
> Yeah, unless there's some reason to think that this creates a
> meaningful memory leak, I wouldn't bother back-patching.
>

The only case where it might impact as per the report of Zhijie Hou is
where the user is subscribed to the publication that has a lot of
tables like Create Publication ... For All Tables. Even though for
such a case the memory consumed could be high but all the memory is
allocated in the Portal context and will be released at the statement
end. I was not sure if that could create a meaningful leak to any user
so to be on the safer side proposed to backpatch it. However, if
others don't think we need to backpatch this then I am fine doing it
just for HEAD.

-- 
With Regards,
Amit Kapila.

Re: proposal: schema variables

2021-01-14 Thread Erik Rijkers


On 2021-01-14 07:35, Pavel Stehule wrote:

[schema-variables-20210114.patch.gz]



Build is fine. My (small) list of tests run OK.

I did notice a few more documentation peculiarities:


'The PostgreSQL has schema variables'  should be
'PostgreSQL has schema variables'


A link to the LET command should be added to the 'See Also' of the 
CREATE VARIABLE, ALTER VARIABLE, and DROP VARIABLE pages. (After all, 
the LET command is the most interesting)
Similarly, an ALTER VARIABLE link should be added to the 'See Also' 
section of LET.



Somehow, the sgml in the doc files causes too large spacing in the html, 
example:

I copy from the LET html:
   'if that is defined.  If no explicit'
   (6 spaces between 'defined.' and 'If')
Can you have a look?  Sorry - I can't find the cause.  It occurs on a 
few more places in the newly added sgml/html.  The unwanted spaces are 
visible also in the pdf.

(firefox 78.6.1, debian)


Thanks,

Erik Rijkers

Re: [bug fix] Fix the size calculation for shmem TOC

2021-01-14 Thread Fujii Masao





On 2021/01/14 17:47, tsunakawa.ta...@fujitsu.com wrote:

Hello,


The attached patch fixes a trivial mistake in the calculation of shmem TOC.  
The original code allocates unduly large memory because it adds the result of 
add_size() to its left argument.

Thanks for the report and patch! The patch looks good to me.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

RE: Parallel INSERT (INTO ... SELECT ...)

2021-01-14 Thread Tang, Haiying

Hi Greg, Amit
Cc:hackers

> > > 4. Have you checked the overhead of this on the planner for 
> > > different kinds of statements like inserts into tables having 100 
> > > or 500 partitions? Similarly, it is good to check the overhead of 
> > > domain related checks added in the patch.
> > >
> >
> > Checking that now and will post results soon.
> >
>I am seeing a fair bit of overhead in the planning for the INSERT 
>parallel-safety checks (mind you, compared to the overall performance 
>gain, it's not too bad).

Considering the 'real-world' use cases and extreme cases I can imagine, I took 
3 kinds of measurements on partition table for the latest patch(V11). 
The measurement is mainly focus on small rows because this could be easier to 
evaluate check overhead among the parallelism optimization.
From current results, the overhead looks acceptable compared to the benefits as 
Greg said.

Test 1: overhead of parallel insert into thousands partitions and 1 rows per 
partition.
%reg=(patched-master)/master
all time= Execution Time+ Planning Time
   |patched|   master   
| %reg  |
---|--|||---|-|-|
partitions |Execution Time(ms)| Planning Time(ms)  | Execution Time(ms) | 
Planning Time(ms) | %reg(Excution Time) | %reg(all time)  |
---|--|||---|-|-|
1000   | 2281.291 |  25.983|  9752.145  |  
0.208|   -77%  | -76%|
2000   | 2303.229 |  50.427|  9446.221  |  
0.227|   -76%  | -75%|
4000   | 2303.207 |  100.946   |  9948.743  |  
0.211|   -77%  | -76%|
6000   | 2411.877 |  152.212   |  9953.114  |  
0.210|   -76%  | -74%|
1  | 2467.235 |  260.751   |  10917.494 |  
0.284|   -77%  | -75%|

Test 2: overhead of parallel insert into thousands partitions and 100 rows per 
partition.
   |patched|   master   
| %reg  |
---|--|||---|-|-|
partitions |Execution Time(ms)| Planning Time(ms)  | Execution Time(ms) | 
Planning Time(ms) | %reg(Excution Time) | %reg(all time)  |
---|--|||---|-|-|
1000   | 2366.620 |  25.787|  14052.748 |  
0.238|   -83%  | -83%|
2000   | 2325.171 |  48.780|  10099.203 |  
0.211|   -77%  | -76%|
4000   | 2599.344 |  110.978   |  10678.065 |  
0.216|   -76%  | -75%|
6000   | 2764.070 |  152.929   |  10880.948 |  
0.238|   -75%  | -73%|
1  | 3043.658 |  265.297   |  11607.202 |  
0.207|   -74%  | -71%|

Test 3: overhead of parallel insert into varying number of partitions and 
inserted rows. 
 |patched|  
 master   | %reg  |
---|-|--|||---|-|-|
partitions |total table rows |Execution Time(ms)| Planning Time(ms)  | 
Execution Time(ms) | Planning Time(ms) | %reg(Excution Time) | %reg(all time)  |
---|-|--|||---|-|-|
100| 1000| 11202.021|  1.593 |  
25668.560 |  0.212|   -56%  | -56%|
500| 1000| 10290.368|  12.722|  
25730.860 |  0.214|   -60%  | -60%|
1000   | 1000| 8946.627 |  24.851|  
26271.026 |  0.219|   -66%  | -66%|
2000   | 1000| 10615.643|  50.111|  
25512.692 |  0.231|   -58%  | -58%|
4000   | 1000| 9056.334 |  105.644   |  
26643.383 |  0.217

Re: POC: postgres_fdw insert batching

2021-01-14 Thread Amit Langote

Hi,

On Thu, Jan 14, 2021 at 2:41 AM Tomas Vondra
 wrote:
> On 1/13/21 3:43 PM, Tomas Vondra wrote:
> > Thanks for the report. Yeah, I think there's a missing check in
> > ExecInsert. Adding
> >
> >   (!resultRelInfo->ri_TrigDesc->trig_insert_after_row)
> >
> > solves this. But now I'm wondering if this is the wrong place to make
> > this decision. I mean, why should we make the decision here, when the
> > decision whether to have a RETURNING clause is made in postgres_fdw in
> > deparseReturningList? We don't really know what the other FDWs will do,
> > for example.
> >
> > So I think we should just move all of this into GetModifyBatchSize. We
> > can start with ri_BatchSize = 0. And then do
> >
> >   if (resultRelInfo->ri_BatchSize == 0)
> > resultRelInfo->ri_BatchSize =
> >   resultRelInfo->ri_FdwRoutine->GetModifyBatchSize(resultRelInfo);
> >
> >   if (resultRelInfo->ri_BatchSize > 1)
> >   {
> > ... do batching ...
> >   }
> >
> > The GetModifyBatchSize would always return value > 0, so either 1 (no
> > batching) or >1 (batching).
> >
>
> FWIW the attached v8 patch does this - most of the conditions are moved
> to the GetModifyBatchSize() callback.

Thanks.  A few comments:

* I agree with leaving it up to an FDW to look at the properties of
the table and of the operation being performed to decide whether or
not to use batching, although maybe BeginForeignModify() is a better
place for putting that logic instead of GetModifyBatchSize()?  So, in
create_foreign_modify(), instead of PgFdwModifyState.batch_size simply
being set to match the table's or the server's value for the
batch_size option, make it also consider the things that prevent
batching and set the execution state's batch_size based on that.
GetModifyBatchSize() simply returns that value.

* Regarding the timing of calling GetModifyBatchSize() to set
ri_BatchSize, I wonder if it wouldn't be better to call it just once,
say from ExecInitModifyTable(), right after BeginForeignModify()
returns?  I don't quite understand why it is being called from
ExecInsert().  Can the batch size change once the execution starts?

* Lastly, how about calling it GetForeignModifyBatchSize() to be
consistent with other nearby callbacks?

> I've removed the check for the
> BatchInsert callback, though - the FDW knows whether it supports that,
> and it seems a bit pointless at the moment as there are no other batch
> callbacks. Maybe we should add an Assert somewhere, though?

Hmm, not checking whether BatchInsert() exists may not be good idea,
because if an FDW's GetModifyBatchSize() returns a value > 1 but
there's no BatchInsert() function to call, ExecBatchInsert() would
trip.  I don't see the newly added documentation telling FDW authors
to either define both or none.

Regarding how this plays with partitions, I don't think we need
ExecGetTouchedPartitions(), because you can get the routed-to
partitions using es_tuple_routing_result_relations.  Also, perhaps
it's a good idea to put the "finishing" ExecBatchInsert() calls into a
function ExecFinishBatchInsert().  Maybe the logic to choose the
relations to perform the finishing calls on will get complicated in
the future as batching is added for updates/deletes too and it seems
better to encapsulate that in the separate function than have it out
in the open in ExecModifyTable().

(Sorry about being so late reviewing this.)

--
Amit Langote
EDB: http://www.enterprisedb.com

[bug fix] Fix the size calculation for shmem TOC

2021-01-14 Thread tsunakawa.ta...@fujitsu.com

Hello,


The attached patch fixes a trivial mistake in the calculation of shmem TOC.  
The original code allocates unduly large memory because it adds the result of 
add_size() to its left argument.


Regards
Takayuki Tsunakawa



0001-Fix-size-calculation-for-shmem-TOC.patch
Description: 0001-Fix-size-calculation-for-shmem-TOC.patch

Re: [PATCH] Identify LWLocks in tracepoints

2021-01-14 Thread Craig Ringer

On Thu, 14 Jan 2021 at 15:56, Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> On 2020-12-19 06:00, Craig Ringer wrote:
> > Patch 1 fixes a bogus tracepoint where an lwlock__acquire event would be
> > fired from LWLockWaitForVar, despite that function never actually
> > acquiring the lock.
>
> This was added in 68a2e52bbaf when LWLockWaitForVar() was first
> introduced.  It looks like a mistake to me too, but maybe Heikki wants
> to comment.
>

I'm certain it's a copy/paste bug.

Re: [PATCH] Identify LWLocks in tracepoints

2021-01-14 Thread Craig Ringer

On Wed, 13 Jan 2021 at 19:19, Dmitry Dolgov <9erthali...@gmail.com> wrote:

> > On Sat, Dec 19, 2020 at 01:00:01PM +0800, Craig Ringer wrote:
> >
> > The attached patch set follows on from the discussion in [1] "Add LWLock
> > blocker(s) information" by adding the actual LWLock* and the numeric
> > tranche ID to each LWLock related TRACE_POSTGRESQL_foo tracepoint.
> >
> > This does not provide complete information on blockers, because it's not
> > necessarily valid to compare any two LWLock* pointers between two process
> > address spaces. The locks could be in DSM segments, and those DSM
> segments
> > could be mapped at different addresses.
> >
> > I wasn't able to work out a sensible way to map a LWLock* to any sort of
> > (tranche-id, lock-index) because there's no requirement that locks in a
> > tranche be contiguous or known individually to the lmgr.
> >
> > Despite that, the patches improve the information available for LWLock
> > analysis significantly.
>
> Thanks for the patches, this could be indeed useful. I've looked through
> and haven't noticed any issues with either the tracepoint extensions or
> commentaries, except that I find it is not that clear how trance_id
> indicates a re-initialization here?
>
> /* Re-initialization of individual LWLocks is not permitted */
> Assert(tranche_id >= NUM_INDIVIDUAL_LWLOCKS || !IsUnderPostmaster);
>

There should be no reason for anything to call LWLockInitialize(...) on an
individual LWLock, since they are all initialized during postmaster startup.

Doing so must be a bug.

But that's a trivial change that can be done separately.

> > Patch 2 adds the tranche id and lock pointer for each trace hit. This
> makes
> > it possible to differentiate between individual locks within a tranche,
> and
> > (so long as they aren't tranches in a DSM segment) compare locks between
> > processes. That means you can do lock-order analysis etc, which was not
> > previously especially feasible.
>
> I'm curious in which kind of situations lock-order analysis could be
> helpful?
>

If code-path 1 does

LWLockAcquire(LockA, LW_EXCLUSIVE);
...
LWLockAcquire(LockB, LW_EXCLUSIVE);

and code-path 2 does:

LWLockAcquire(LockB, LW_EXCLUSIVE);
...
LWLockAcquire(LockA, LW_EXCLUSIVE);

then they're subject to deadlock. But you might not actually hit that often
in test workloads if the timing required for the deadlock to occur is tight
and/or occurs on infrequent operations.

It's not always easy to reason about or prove things about lock order when
they're potentially nested deep within many layers of other calls and
callbacks. Obviously something we try to avoid with LWLocks, but not
impossible.

If you trace a workload and derive all possible nestings of lock acquire
order, you can then prove things about whether there are any possible
ordering conflicts and where they might arise.

A PoC to do so is on my TODO.

> Traces also don't have to do userspace reads for the tranche name all
> > the time, so the trace can run with lower overhead.
>
> This one is also interesting. Just for me to clarify, wouldn't there be
> a bit of overhead anyway (due to switching from kernel context to user
> space when a tracepoint was hit) that will mask name read overhead? Or
> are there any available numbers about it?
>

I don't have numbers on that. Whether it matters will depend way too much
on how you're using the probe points and collecting/consuming the data
anyway.

It's a bit unfortunate (IMO) that we make a function call for each
tracepoint invocation to get the tranche names. Ideally I'd prefer to be
able to omit the tranche names lookups for these probes entirely for
something as hot as LWLocks. But it's a bit of a pain to look up the
tranche names from an external trace tool, so instead I'm inclined to see
if we can enable systemtap's semaphores and only compute the tranche name
if the target probe is actually enabled. But that'd be separate to this
patch and require a build change in how systemtap support is compiled and
linked.

BTW, a user->kernel->user context switch only occurs when the trace tool's
probes use kernel space - such as for perf based probes, or for systemtap's
kernel-runtime probes. The same markers can be used by e.g. systemtap's
"dyninst" runtime that runs entirely in userspace.

Re: Protect syscache from bloating with negative cache entries

2021-01-14 Thread Kyotaro Horiguchi

Hello.

The commit 4656e3d668 (debug_invalidate_system_caches_always)
conflicted with this patch. Rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ec069488fd2675369530f3f967f02a7b683f0a7f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 18 Nov 2020 16:54:31 +0900
Subject: [PATCH v6 1/3] CatCache expiration feature

---
 src/backend/access/transam/xact.c  |  3 ++
 src/backend/utils/cache/catcache.c | 87 +-
 src/backend/utils/misc/guc.c   | 12 +
 src/include/utils/catcache.h   | 19 +++
 4 files changed, 120 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a2068e3fd4..86888d2409 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1086,6 +1086,9 @@ static void
 AtStart_Cache(void)
 {
 	AcceptInvalidationMessages();
+
+	if (xactStartTimestamp != 0)
+		SetCatCacheClock(xactStartTimestamp);
 }
 
 /*
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index fa2b49c676..644d92dd9a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -38,6 +38,7 @@
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/syscache.h"
+#include "utils/timestamp.h"
 
 
  /* #define CACHEDEBUG */	/* turns DEBUG elogs on */
@@ -60,9 +61,19 @@
 #define CACHE_elog(...)
 #endif
 
+/*
+ * GUC variable to define the minimum age of entries that will be considered
+ * to be evicted in seconds. -1 to disable the feature.
+ */
+int catalog_cache_prune_min_age = -1;
+uint64	prune_min_age_us;
+
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+/* Clock for the last accessed time of a catcache entry. */
+uint64	catcacheclock = 0;
+
 static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
 			   int nkeys,
 			   Datum v1, Datum v2,
@@ -74,6 +85,7 @@ static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
 Index hashIndex,
 Datum v1, Datum v2,
 Datum v3, Datum v4);
+static bool CatCacheCleanupOldEntries(CatCache *cp);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
 		   Datum v1, Datum v2, Datum v3, Datum v4);
@@ -99,6 +111,15 @@ static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
 			 Datum *srckeys, Datum *dstkeys);
 
+/* GUC assign function */
+void
+assign_catalog_cache_prune_min_age(int newval, void *extra)
+{
+	if (newval < 0)
+		prune_min_age_us = UINT64_MAX;
+	else
+		prune_min_age_us = ((uint64) newval) * USECS_PER_SEC;
+}
 
 /*
  *	internal support functions
@@ -1264,6 +1285,9 @@ SearchCatCacheInternal(CatCache *cache,
 		 */
 		dlist_move_head(bucket, &ct->cache_elem);
 
+		/* Record the last access timestamp */
+		ct->lastaccess = catcacheclock;
+
 		/*
 		 * If it's a positive entry, bump its refcount and return it. If it's
 		 * negative, we can report failure to the caller.
@@ -1425,6 +1449,61 @@ SearchCatCacheMiss(CatCache *cache,
 	return &ct->tuple;
 }
 
+/*
+ * CatCacheCleanupOldEntries - Remove infrequently-used entries
+ *
+ * Catcache entries happen to be left unused for a long time for several
+ * reasons. Remove such entries to prevent catcache from bloating. It is based
+ * on the similar algorithm with buffer eviction. Entries that are accessed
+ * several times in a certain period live longer than those that have had less
+ * access in the same duration.
+ */
+static bool
+CatCacheCleanupOldEntries(CatCache *cp)
+{
+	int		nremoved = 0;
+	int		i;
+	long	oldest_ts = catcacheclock;
+	uint64	prune_threshold = catcacheclock - prune_min_age_us;
+
+	/* Scan over the whole hash to find entries to remove */
+	for (i = 0 ; i < cp->cc_nbuckets ; i++)
+	{
+		dlist_mutable_iter	iter;
+
+		dlist_foreach_modify(iter, &cp->cc_bucket[i])
+		{
+			CatCTup*ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+			/* Don't remove referenced entries */
+			if (ct->refcount == 0 &&
+(ct->c_list == NULL || ct->c_list->refcount == 0))
+			{
+if (ct->lastaccess < prune_threshold)
+{
+	CatCacheRemoveCTup(cp, ct);
+	nremoved++;
+
+	/* don't let the removed entry update oldest_ts */
+	continue;
+}
+			}
+
+			/* update the oldest timestamp if the entry remains alive */
+			if (ct->lastaccess < oldest_ts)
+oldest_ts = ct->lastaccess;
+		}
+	}
+
+	cp->cc_oldest_ts = oldest_ts;
+
+	if (nremoved > 0)
+		elog(DEBUG1, "pruning catalog cache id=%d for %s: removed %d / %d",
+			 cp->id, cp->cc_relname, nremoved, cp->cc_ntup + nremoved);
+
+	return nremoved > 0;
+}
+
 /*
  *	ReleaseCatCache
  *
@@ -1888,6 +1967,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	ct->dead = false;
 	ct->negative = negative;
 	ct->hash_value = hashValue;
+	ct->lastacc

Re: In-placre persistance change of a relation

2021-01-14 Thread Kyotaro Horiguchi

At Tue, 12 Jan 2021 18:58:08 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Fri, 08 Jan 2021 17:52:21 +0900 (JST), Kyotaro Horiguchi 
>  wrote in 
> > At Fri, 08 Jan 2021 14:47:05 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > This version RelationChangePersistence() is changed not to choose
> > > in-place method for indexes other than btree. It seems to be usable
> > > with all kind of indexes other than Gist, but at the mement it applies
> > > only to btrees.
> > > 
> > > 1: 
> > > https://www.postgresql.org/message-id/ca+tgmozez5rons49c7mepjhjndqmqtvrz_lcqukprwdmrev...@mail.gmail.com
> > 
> > Hmm. This is not wroking correctly. I'll repost after fixint that.
> 
> I think I fixed the misbehavior. ResetUnloggedRelationsInDbspaceDir()
> handles file operations in the wrong order and with the wrong logic.
> It also needed to drop buffers and forget fsync requests.
> 
> I thought that the two cases that this patch is expected to fix
> (orphan relation files and uncommited init files) can share the same
> "cleanup" fork but that is wrong. I had to add one more additional
> fork to differentiate the cases of SET UNLOGGED and of creation of
> UNLOGGED tables...
> 
> The attached is a new version, that seems working correctly but looks
> somewhat messy. I'll continue working.

Commit bea449c635 conflicts with this on the change of the definition
of DropRelFileNodeBuffers. The change simplified this patch by a bit:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 5f785f181acdac18952f504ec45ce41f285c05bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 11 Nov 2020 21:51:11 +0900
Subject: [PATCH v5 1/2] In-place table persistence change

Even though ALTER TABLE SET LOGGED/UNLOGGED does not require data
rewriting, currently it runs heap rewrite which causes large amount of
file I/O.  This patch makes the command run without heap rewrite.
Addition to that, SET LOGGED while wal_level > minimal emits WAL using
XLOG_FPI instead of massive number of HEAP_INSERT's, which should be
smaller.

Also this allows for the cleanup of files left behind in the crash of
the transaction that created it.
---
 src/backend/access/rmgrdesc/smgrdesc.c |  23 ++
 src/backend/access/transam/README  |   8 +
 src/backend/access/transam/xlog.c  |  17 +
 src/backend/catalog/storage.c  | 436 +++--
 src/backend/commands/tablecmds.c   | 246 +++---
 src/backend/storage/buffer/bufmgr.c|  88 +
 src/backend/storage/file/reinit.c  | 316 --
 src/backend/storage/smgr/md.c  |  13 +-
 src/backend/storage/smgr/smgr.c|   6 +
 src/common/relpath.c   |   4 +-
 src/include/catalog/storage.h  |   2 +
 src/include/catalog/storage_xlog.h |  22 +-
 src/include/common/relpath.h   |   6 +-
 src/include/storage/bufmgr.h   |   2 +
 src/include/storage/md.h   |   2 +
 src/include/storage/reinit.h   |   3 +-
 src/include/storage/smgr.h |   1 +
 17 files changed, 1028 insertions(+), 167 deletions(-)

diff --git a/src/backend/access/rmgrdesc/smgrdesc.c b/src/backend/access/rmgrdesc/smgrdesc.c
index 773d57..2c109b8ca4 100644
--- a/src/backend/access/rmgrdesc/smgrdesc.c
+++ b/src/backend/access/rmgrdesc/smgrdesc.c
@@ -40,6 +40,23 @@ smgr_desc(StringInfo buf, XLogReaderState *record)
 		 xlrec->blkno, xlrec->flags);
 		pfree(path);
 	}
+	else if (info == XLOG_SMGR_UNLINK)
+	{
+		xl_smgr_unlink *xlrec = (xl_smgr_unlink *) rec;
+		char	   *path = relpathperm(xlrec->rnode, xlrec->forkNum);
+
+		appendStringInfoString(buf, path);
+		pfree(path);
+	}
+	else if (info == XLOG_SMGR_BUFPERSISTENCE)
+	{
+		xl_smgr_bufpersistence *xlrec = (xl_smgr_bufpersistence *) rec;
+		char	   *path = relpathperm(xlrec->rnode, MAIN_FORKNUM);
+
+		appendStringInfoString(buf, path);
+		appendStringInfo(buf, " persistence %d", xlrec->persistence);
+		pfree(path);
+	}
 }
 
 const char *
@@ -55,6 +72,12 @@ smgr_identify(uint8 info)
 		case XLOG_SMGR_TRUNCATE:
 			id = "TRUNCATE";
 			break;
+		case XLOG_SMGR_UNLINK:
+			id = "UNLINK";
+			break;
+		case XLOG_SMGR_BUFPERSISTENCE:
+			id = "BUFPERSISTENCE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index 1edc8180c1..547107a771 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -724,6 +724,14 @@ we must panic and abort recovery.  The DBA will have to manually clean up
 then restart recovery.  This is part of the reason for not writing a WAL
 entry until we've successfully done the original action.
 
+The CLEANUP fork file
+
+
+An CLEANUP fork is created when a new relation file is created to mark
+the relfilenode needs to be cleaned up at recovery time.  In contrast
+to 4 above, failure to remove an CLEANUP fork file will lead to data
+loss, in which case the server will shut down.
+
 
 Skipping WAL for N

Re: Improper use about DatumGetInt32

2021-01-14 Thread Michael Paquier

On Wed, Jan 13, 2021 at 09:27:37AM +0100, Peter Eisentraut wrote:
> Interesting idea.  Here is a patch that incorporates that.

Thanks for adding some coverage.

This patch needs a small rebase as Heikki has just introduced some
functions for gist, bumping the module to 1.9 (no need to bump to
1.10, right?).

I don't have more comments by reading the code and my tests have
passed after applying the patch on top of df10ac62.  I would have also
added some tests that check after blkno < 0 and > MaxBlockNumber in
all the functions where it can be triggered as that's cheap for 1.8
and 1.9, but that it's a minor point.
--
Michael


signature.asc
Description: PGP signature

97 matches

Mail list logo