Re: A qsort template

2021-06-15 Thread Thomas Munro
On Mon, Mar 15, 2021 at 1:09 PM Thomas Munro  wrote:
> On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu  wrote:
> > + * Remove duplicates from an array.  Return the new size.
> > + */
> > +ST_SCOPE size_t
> > +ST_UNIQUE(ST_ELEMENT_TYPE *array,
> >
> > The array is supposed to be sorted, right ?
> > The comment should mention this.
>
> Good point, will update.  Thanks!

Rebased.  Also fixed some formatting problems and updated
typedefs.list so they don't come back.


more-sort-search-specializations-v2.tgz
Description: application/compressed-tar


Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 9:59 PM Noah Misch  wrote:
> Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
> see that, among other ways, when regression fixes spike in each vN.1.  The
> $SUBJECT feature was born in response to a user experience; a lack of hacker
> interest doesn't invalidate that user experience.

I agree that it would be good to hear from some users about this. If a
less painful workaround is possible at all, then users may be able to
help -- maybe it'll be possible to cut scope.

> We face these competing
> interests, at least:

> 1) Some users want the feature kept so their application can use a certain
>pattern of long-running, snapshot-bearing transactions.

Undoubtedly true.

> 2) (a) Some hackers want the feature gone so they can implement changes
>without making those changes cooperate with this feature.  (b) Bugs in this
>feature make such cooperation materially harder.

Is that really true? Though it was probably true back when this thread
was started last year, things have changed. Andres found a way to work
around the problems he had with snapshot too old, AFAIK.

> A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
> essentially.  What other interests are relevant?

The code simply isn't up to snuff. If the code was in a niche contrib
module then maybe it would be okay to let this slide. But the fact is
that it touches critical parts of the system. This cannot be allowed
to drag on forever. It's as simple as that.

I admit that I think that the most likely outcome is that it gets
reverted. I don't feel great about that. What else can be done about
it that will really help the situation? No qualified person is likely
to have the time to commit to fixing snapshot too old. Isn't that the
real problem here?

-- 
Peter Geoghegan




Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud
On Wed, Jun 16, 2021 at 01:17:11AM -0400, Stephen Frost wrote:
> 
> The archive command is technically invoked using the shell, but the
> interpretation of the exit code, for example, is only discussed in the C
> code, but it’s far from the only consideration that someone developing an
> archive command needs to understand.

The expectations for the return code are documented.  There are some subtleties
for when the command is interrupted by a signal, which are also documented.
Why shouldn't we document the rest of the expectation of what such a command
should do?

> The notion that an archive command can be distanced from backups is really
> not reasonable in my opinion.

As far as I know you can use archiving for replication purpose only.  In such
case you certainly will have different usage of the archived files compared to
backups, and different verifications.  But the requirements on what makes an
archive_command safe won't change.

> > Consider that, really, an archive command should refuse to allow archiving
> > > of WAL on a timeline which doesn’t have a corresponding history file in
> > the
> > > archive for that timeline (excluding timeline 1).
> >
> > Yes, that's a clear requirement that should be documented.
> 
> 
> Is it a clear requirement that pgbackrest and every other organization that
> has developed an archive command has missed? Are you able to point to a
> tool which has such a check today?

I don't know, as I don't have any knowledge of what barman, BART, pgbackrest,
pg_probackup or any other backup solution does in any detail. I was only saying
that what you said makes sense and should be part of the documentation,
assuming that this is indeed a requirement.

> > Also, a backup tool
> > > should compare the result of pg_start_backup to what’s in the control
> > file,
> > > using a fresh read, after start backup returns to make sure that the
> > > storage is sane and not, say, cache’ing pages independently (such as
> > might
> > > happen with a separate NFS mount..).  Oh, and if a replica is involved, a
> > > check should be done to see if the replica has changed timelines and an
> > > appropriate message thrown if that happens complaining that the backup
> > was
> > > aborted due to the promotion of the replica…
> >
> > I agree, but unless I'm missing something it's unrelated to what an
> > archive_command should be in charge of?
> 
> I’m certainly not moved by this argument as it seems to be willfully
> missing the point.  Further, if we are going to claim that we must document
> archive command to such level then surely we need to also document all the
> requirements for pg_start_backup and pg_stop_backup too, so this strikes me
> as entirely relevant.

So what was the point?  I'm not saying that doing backup is trivial and/or
should not be properly documented, nor that we shouldn't improve
pg_start_backup or pg_stop_backup documentation, I'm just saying that those
doesn't change what makes an archive_command safe.




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Amit Kapila
On Tue, Jun 15, 2021 at 8:11 PM Amit Kapila  wrote:
>
> On Mon, Jun 14, 2021 at 9:08 PM Robert Haas  wrote:
> >
> > On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila  wrote:
> >
> > > Yeah, this could be one idea but I think even if we use pg_proc OID,
> > > we still need to check all the rel cache entries to find which one
> > > contains the invalidated OID and that could be expensive. I wonder
> > > can't we directly find the relation involved and register invalidation
> > > for the same? We are able to find the relation to which trigger
> > > function is associated during drop function via findDependentObjects
> > > by scanning pg_depend. Assuming, we are able to find the relation for
> > > trigger function by scanning pg_depend, what kinds of problems do we
> > > envision in registering the invalidation for the same?
> >
> > I don't think that finding the relation involved and registering an
> > invalidation for the same will work properly. Suppose there is a
> > concurrently-running transaction which has created a new table and
> > attached a trigger function to it. You can't see any of the catalog
> > entries for that relation yet, so you can't invalidate it, but
> > invalidation needs to happen. Even if you used some snapshot that can
> > see those catalog entries before they are committed, I doubt it fixes
> > the race condition. You can't hold any lock on that relation, because
> > the creating transaction holds AccessExclusiveLock, but the whole
> > invalidation mechanism is built around the assumption that the sender
> > puts messages into the shared queue first and then releases locks,
> > while the receiver first acquires a conflicting lock and then
> > processes messages from the queue.
> >
>
> Won't such messages be proceesed at start transaction time
> (AtStart_Cache->AcceptInvalidationMessages)?
>

Even if accept invalidation at the start transaction time, we need to
accept and execute it after taking a lock on relation to ensure that
relation doesn't change afterward. I think what I mentioned didn't
break this assumption because after finding a relation we will take a
lock on it before registering the invalidation, so in the above
scenario, it should wait before registering the invalidation.

-- 
With Regards,
Amit Kapila.




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Amit Kapila
On Tue, Jun 15, 2021 at 7:31 PM Robert Haas  wrote:
>
> On Tue, Jun 15, 2021 at 7:05 AM Amit Kapila  wrote:
> > Okay, but I think if we go with your suggested model where whenever
> > there is a change in parallel-safety of any function, we need to send
> > the new invalidation then I think it won't matter whether the function
> > is associated with index expression, check constraint in the table, or
> > is used in any other way.
>
> No, it will still matter, because I'm proposing that the
> parallel-safety of functions that we only access via operator classes
> will not even be checked.
>

I am not very clear on what exactly you have in your mind in this
regard. I understand that while computing parallel-safety for a rel we
don't need to consider functions that we only access via operator
class but how do we distinguish such functions during Alter Function?
Is there a simple way to deduce that this is an operator class
function so don't register invalidation for it? Shall we check it via
pg_depend?

>
> > We can additionally check the
> > parallel safety of partitions when we are trying to insert into a
> > particular partition and error out if we detect any parallel-unsafe
> > clause and we are in parallel-mode. So, in this case, we won't be
> > completely relying on the users. Users can either change the parallel
> > safe option of the table or remove/change the parallel-unsafe clause
> > after error. The new invalidation message as we are discussing would
> > invalidate the parallel-safety for individual partitions but not the
> > root partition (partitioned table itself). For root partition, we will
> > rely on information specified by the user.
>
> Yeah, that may be the best we can do. Just to be clear, I think we
> would want to check whether the relation is still parallel-safe at the
> start of the operation, but not have a run-time check at each function
> call.
>

Agreed, that is what I also had in mind.

> > I am not sure if we have a simple way to check the parallel safety of
> > partitioned tables. In some way, we need to rely on user either (a) by
> > providing an option to specify whether parallel Inserts (and/or other
> > DMLs) can be performed, or (b) by providing a guc and/or rel option
> > which indicate that we can check the parallel-safety of all the
> > partitions. Yet another option that I don't like could be to
> > parallelize inserts on non-partitioned tables.
>
> If we figure out a way to check the partitions automatically that
> actually works, we don't need a switch for it; we can (and should)
> just do it that way all the time. But if we can't come up with a
> correct algorithm for that, then we'll need to add some kind of option
> where the user declares whether it's OK.
>

Yeah, so let us think for some more time and see if we can come up
with something better for partitions, otherwise, we can sort out
things further in this direction.

-- 
With Regards,
Amit Kapila.




Re: Race condition in recovery?

2021-06-15 Thread Kyotaro Horiguchi
At Tue, 15 Jun 2021 07:54:49 -0400, Andrew Dunstan  wrote 
in 
> 
> On 6/15/21 2:16 AM, Kyotaro Horiguchi wrote:
> > At Fri, 11 Jun 2021 10:46:45 -0400, Tom Lane  wrote in 
> >> I think jacana uses msys[2?], so this likely indicates a problem
> >> in path sanitization for the archive command.  Andrew, any advice?
> > Thanks for fixing it.
> >
> > # I haven't still succeed to run TAP tests on MSYS2 environment. I
> > # cannot install IPC::Run for msys perl..
> >
> > regards.
> >
> 
> 
> Unpack the attached somewhere and point your PERL5LIB at it. That's all
> I do.

Thanks a lot, Andrew.  I get to run the TAP test with it and saw the
same error with jacana.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Duplicate history file?

2021-06-15 Thread Stephen Frost
Greetings,

On Tue, Jun 15, 2021 at 23:21 Julien Rouhaud  wrote:

> On Tue, Jun 15, 2021 at 11:00:57PM -0400, Stephen Frost wrote:
> >
> > As I suggested previously- this is similar to the hooks that we provide.
> We
> > don’t extensively document them because if you’re writing an extension
> > which uses a hook, you’re going to be (or should be..) reading the code
> too.
>
> I disagree, hooks allows developers to implement some new or additional
> behavior which by definition can't be documented.  And it's also relying
> on the
> C api which by definition allows you to do anything with the server.
> There are
> also of course some requirements but they're quite obvious (like a
> planner_hook
> should return a valid plan and such).


The archive command is technically invoked using the shell, but the
interpretation of the exit code, for example, is only discussed in the C
code, but it’s far from the only consideration that someone developing an
archive command needs to understand.

On the other hand the archive_command is there to do only one clear thing:
> safely backup a WAL file.  And I don't think that what makes that backup
> "safe"
> is open to discussion.  Sure, you can chose to ignore some of it if you
> think
> you can afford to do it, but it doesn't change the fact that it's still a
> requirement which should be documented.


The notion that an archive command can be distanced from backups is really
not reasonable in my opinion.

> Consider that, really, an archive command should refuse to allow archiving
> > of WAL on a timeline which doesn’t have a corresponding history file in
> the
> > archive for that timeline (excluding timeline 1).
>
> Yes, that's a clear requirement that should be documented.


Is it a clear requirement that pgbackrest and every other organization that
has developed an archive command has missed? Are you able to point to a
tool which has such a check today?

This is not a trivial problem any more than PG’s use of fsync is trivial
and we clearly should have understood how Linux and fsync work decades ago
and made sure to always crash on any fsync failure and not believe that a
later fsync would return a failure if an earlier one did and the problem
didn’t resolve itself properly.

> Also, a backup tool
> > should compare the result of pg_start_backup to what’s in the control
> file,
> > using a fresh read, after start backup returns to make sure that the
> > storage is sane and not, say, cache’ing pages independently (such as
> might
> > happen with a separate NFS mount..).  Oh, and if a replica is involved, a
> > check should be done to see if the replica has changed timelines and an
> > appropriate message thrown if that happens complaining that the backup
> was
> > aborted due to the promotion of the replica…
>
> I agree, but unless I'm missing something it's unrelated to what an
> archive_command should be in charge of?


I’m certainly not moved by this argument as it seems to be willfully
missing the point.  Further, if we are going to claim that we must document
archive command to such level then surely we need to also document all the
requirements for pg_start_backup and pg_stop_backup too, so this strikes me
as entirely relevant.

Thanks,

Stephen

>


Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Noah Misch
On Tue, Jun 15, 2021 at 02:32:11PM -0700, Peter Geoghegan wrote:
> What I had in mind was this: a committer adopting the feature
> themselves. The committer would be morally obligated to maintain the
> feature on an ongoing basis, just as if they were the original
> committer. This seems like the only sensible way of resolving this
> issue once and for all.
> 
> If it really is incredibly important that we keep this feature, or one
> like it, then I have to imagine that somebody will step forward --
> there is still ample opportunity. But if nobody steps forward, I'll be
> forced to conclude that perhaps it wasn't quite as important as I
> first thought.

Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
see that, among other ways, when regression fixes spike in each vN.1.  The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience.  We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
   pattern of long-running, snapshot-bearing transactions.

2) (a) Some hackers want the feature gone so they can implement changes
   without making those changes cooperate with this feature.  (b) Bugs in this
   feature make such cooperation materially harder.

3) Some users want the feature gone because (2) is slowing the progress of
   features they do want.

4) Some users want the feature kept because they don't use it but will worry
   what else is vulnerable to removal.  PostgreSQL has infrequent history of
   removing released features.  Normally, PostgreSQL lets some bugs languish
   indefinitely, e.g. in
   https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Live_issues

5) Some users want the feature gone because they try it, find a bug, and
   regret trying it or fear trying other features.

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially.  What other interests are relevant?




Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud
On Wed, Jun 16, 2021 at 01:10:16PM +0900, Kyotaro Horiguchi wrote:
> 
> I agree to Julien, however, I want to discuss (also) on what to do for
> 14 now.  If we decide not to touch the document for the version. that
> discussion would end.  What do you think about that?  I think it's
> impossible to write the full-document for the requirements *for 14*.

My personal take on that is that this is a bug in the documentation and the
list of requirements should be backported.  Hopefully this can be done before
v14 is released, but if not I don't think that it should be a blocker to make
progress.




Re: Duplicate history file?

2021-06-15 Thread Kyotaro Horiguchi
At Wed, 16 Jun 2021 11:20:55 +0800, Julien Rouhaud  wrote 
in 
> On Tue, Jun 15, 2021 at 11:00:57PM -0400, Stephen Frost wrote:
> > 
> > As I suggested previously- this is similar to the hooks that we provide. We
> > don’t extensively document them because if you’re writing an extension
> > which uses a hook, you’re going to be (or should be..) reading the code too.
> 
> I disagree, hooks allows developers to implement some new or additional
> behavior which by definition can't be documented.  And it's also relying on 
> the
> C api which by definition allows you to do anything with the server.  There 
> are
> also of course some requirements but they're quite obvious (like a 
> planner_hook
> should return a valid plan and such).
> 
> On the other hand the archive_command is there to do only one clear thing:
> safely backup a WAL file.  And I don't think that what makes that backup 
> "safe"
> is open to discussion.  Sure, you can chose to ignore some of it if you think
> you can afford to do it, but it doesn't change the fact that it's still a
> requirement which should be documented.

I agree to Julien, however, I want to discuss (also) on what to do for
14 now.  If we decide not to touch the document for the version. that
discussion would end.  What do you think about that?  I think it's
impossible to write the full-document for the requirements *for 14*.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Improving isolationtester's data output

2021-06-15 Thread Noah Misch
On Tue, Jun 15, 2021 at 09:43:31PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-06-15 19:26:25 -0400, Tom Lane wrote:
> >> Going forward it wouldn't be a problem, but back-patching isolation
> >> test cases might find it annoying.  On the other hand, my nearby
> >> patch to improve isolation test stability is already going to create
> >> issues of that sort.  (Unless, dare I say it, we back-patch that.)
> 
> > It might be worth to back-patch - aren't there some back branch cases of
> > test instability? And perhaps more importantly, I'm sure we'll encounter
> > cases of writing new isolation tests in the course of fixing bugs that
> > we'd want to backpatch that are hard to make reliable without the new
> > features?
> 
> Yeah, there absolutely is a case to back-patch things like this.  Whether
> it's a strong enough case, I dunno.  I'm probably too close to the patch
> to have an unbiased opinion about that.
> 
> However, a quick look through the commit history finds several places
> where we complained about not being able to back-patch isolation tests to
> before 9.6 because we hadn't back-patched that version's isolationtester
> improvements.  I found 6b802cfc7, 790026972, c88411995, 8b21b416e without
> looking too hard.  So that history certainly suggests that not
> back-patching such test infrastructure is the Wrong Thing.

I'm +1 for back-patching this class of change.  I've wasted time adapting a
back-patch's test case to account for non-back-patched test infrastructure
changes.  Every back-patch of test infrastructure has been a strict win from
my perspective.




RE: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread houzj.f...@fujitsu.com
On Tuesday, June 15, 2021 10:01 PM Robert Haas  wrote:
> On Tue, Jun 15, 2021 at 7:05 AM Amit Kapila  wrote:
> > Yeah, dealing with partitioned tables is tricky. I think if we don't
> > want to check upfront the parallel safety of all the partitions then
> > the other option as discussed could be to ask the user to specify the
> > parallel safety of partitioned tables.
> 
> Just to be clear here, I don't think it really matters what we *want* to do. 
> I don't
> think it's reasonably *possible* to check all the partitions, because we don't
> hold locks on them. When we're assessing a bunch of stuff related to an
> individual relation, we have a lock on it. I think - though we should
> double-check tablecmds.c - that this is enough to prevent all of the dependent
> objects - triggers, constraints, etc. - from changing. So the stuff we care 
> about
> is stable. But the situation with a partitioned table is different. In that 
> case, we
> can't even examine that stuff without locking all the partitions. And even if 
> we
> do lock all the partitions, the stuff could change immediately afterward and 
> we
> wouldn't know. So I think it would be difficult to make it correct.
> 
> Now, maybe it could be done, and I think that's worth a little more thought. 
> For
> example, perhaps whenever we invalidate a relation, we could also somehow
> send some new, special kind of invalidation for its parent saying, 
> essentially,
> "hey, one of your children has changed in a way you might care about." But
> that's not good enough, because it only goes up one level. The grandparent
> would still be unaware that a change it potentially cares about has occurred
> someplace down in the partitioning hierarchy. That seems hard to patch up,
> again because of the locking rules. The child can know the OID of its parent
> without locking the parent, but it can't know the OID of its grandparent 
> without
> locking its parent. Walking up the whole partitioning hierarchy might be an
> issue for a number of reasons, including possible deadlocks, and possible race
> conditions where we don't emit all of the right invalidations in the face of
> concurrent changes. So I don't quite see a way around this part of the 
> problem,
> but I may well be missing something. In fact I hope I am missing something,
> because solving this problem would be really nice.

I think the check of partition could be even more complicated if we need to
check the parallel safety of partition key expression. If user directly insert 
into
a partition, then we need invoke ExecPartitionCheck which will execute all it's
parent's and grandparent's partition key expressions. It means if we change a
parent table's partition key expression(by 1) change function in expr or 2) 
attach
the parent table as partition of another parent table), then we need to 
invalidate
all its child's relcache.

BTW, currently, If user attach a partitioned table 'A' to be partition of 
another
partitioned table 'B', the child of 'A' will not be invalidated.

Best regards,
houzj


Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud
On Tue, Jun 15, 2021 at 11:00:57PM -0400, Stephen Frost wrote:
> 
> As I suggested previously- this is similar to the hooks that we provide. We
> don’t extensively document them because if you’re writing an extension
> which uses a hook, you’re going to be (or should be..) reading the code too.

I disagree, hooks allows developers to implement some new or additional
behavior which by definition can't be documented.  And it's also relying on the
C api which by definition allows you to do anything with the server.  There are
also of course some requirements but they're quite obvious (like a planner_hook
should return a valid plan and such).

On the other hand the archive_command is there to do only one clear thing:
safely backup a WAL file.  And I don't think that what makes that backup "safe"
is open to discussion.  Sure, you can chose to ignore some of it if you think
you can afford to do it, but it doesn't change the fact that it's still a
requirement which should be documented.

> Consider that, really, an archive command should refuse to allow archiving
> of WAL on a timeline which doesn’t have a corresponding history file in the
> archive for that timeline (excluding timeline 1).

Yes, that's a clear requirement that should be documented.

> Also, a backup tool
> should compare the result of pg_start_backup to what’s in the control file,
> using a fresh read, after start backup returns to make sure that the
> storage is sane and not, say, cache’ing pages independently (such as might
> happen with a separate NFS mount..).  Oh, and if a replica is involved, a
> check should be done to see if the replica has changed timelines and an
> appropriate message thrown if that happens complaining that the backup was
> aborted due to the promotion of the replica…

I agree, but unless I'm missing something it's unrelated to what an
archive_command should be in charge of?




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 5:00 PM David Rowley  wrote:
> Most of the time when I see that happen it's down to either the
> selectivity of some correlated base-quals being multiplied down to a
> number low enough that we clamp the estimate to be 1 row.   The other
> case is similar, but with join quals.
>
> It seems to me that each time we multiply 2 selectivities together
> that the risk of the resulting selectivity being out increases.  The
> risk is likely lower when we have some extended statistics which
> allows us to do fewer selectivity multiplications.

It seems important to distinguish between risk and uncertainty --
they're rather different things. The short version is that you can
model risk but you cannot model uncertainty. This seems like a problem
of uncertainty to me.

The example from the paper that Robert cited isn't interesting to me
because it hints at a way of managing the uncertainty, exactly. It's
interesting because it seems to emphasize the user's exposure to the
problem, which is what really matters. Even if it was extremely
unlikely that the user would have a problem, the downside of being
wrong is still absurdly high, and the upside of being right is low
(possibly even indistinguishable from zero). It's just not worth
thinking about. Besides, we all know that selectivity estimates are
very often quite wrong without the user ever noticing. It's amazing
that database optimizers work as well as they do, really.

-- 
Peter Geoghegan




Re: Duplicate history file?

2021-06-15 Thread Kyotaro Horiguchi
At Wed, 16 Jun 2021 12:04:03 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> Ok, if we follow the direction that we are responsible for ensuring
> that every user has reliable backups, I don't come up with proper
> description about that.
> 
> We could list several "requirement" like "do sync after copy", "take a
> checksum for all files then check it periodically" or other things but
> what is more important things to list here, I think, is "how we run
> the archive_command".
> 
> Doesn't the following work for now?
> 
> (No example)
> 
> - "%f is replace by ... %p is .., %r is ... in archive_command"
> 
> - We call the archive_command for every wal segment which is finished.
> 
> - We may call the archive_command for the same file more than once.
> 
> - We may call the archive_command for different files with the same
>   name. In this case server is working incorrectly and need a
>   check. Don't overwrite with the new content.
> 
> - We don't offer any durability or integrity on the archived
>   files. All of them is up to you.  You can use some existing
>   solutions for archiving. See the following links.

Of course, there should be some descriptions about error handling
along with.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Duplicate history file?

2021-06-15 Thread Kyotaro Horiguchi
Thanks for the opinions.

At Tue, 15 Jun 2021 11:33:10 -0400, Stephen Frost  wrote in 
> Greetings,
> 
> * Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> > At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier  
> > wrote in 
> > > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> > > > I think cp can be an example as far as we explain the limitations. (On
> > > > the other hand "test !-f" cannot since it actually prevents server
> > > > from working correctly.)
> > > 
> > > Disagreed.  I think that we should not try to change this area until
> > > we can document a reliable solution, and a simple "cp" is not that.
> > 
> > Isn't removing cp from the documentation a change in this area? I
> > basically agree to not to change anything but the current example
> > "test ! -f  && cp .." and relevant description has been known to
> > be problematic in a certain situation.
> 
> [...]
> 
> > - Write the full (known) requirements and use a pseudo tool-name in
> >  the example?
> 
> I'm generally in favor of just using a pseudo tool-name and then perhaps
> providing a link to a new place on .Org where people can ask to have
> their PG backup solution listed, or something along those lines.

Looks fine.

> >  - provide a minimal implement of the command?
> 
> Having been down this road for a rather long time, I can't accept this
> as a serious suggestion.  No, not even with Perl.  Been there, done
> that, not going back.
>
> >  - recommend some external tools (that we can guarantee that they
> >comform the requriements)?
> 
> The requirements are things which are learned over years and changes
> over time.  Trying to document them and keep up with them would be a
> pretty serious project all on its own.  There are external projects who
> spend serious time and energy doing their best to provide the tooling
> needed here and we should be promoting those, not trying to pretend like
> this is a simple thing which anyone could write a short perl script to
> accomplish.

I agree that no simple solution could be really perfect.  The reason I
think that a simple cp can be a candidate of the example might be
based on the assumption that anyone who is going to build a database
system ought to know their requirements including the
durability/reliability of archives/backups and the limitaions of
adopted methods/technologies.  However, as Julien mentioned, if
there's actually a problem that relatively.. ahem, ill-advised users
(sorry in advance if it's rude) uses the 'cp' only for the reason that
it is shown in the example without a thought and inadvertently loses
archives, it might be better that we don't suggest a concrete command
for archive_command.

> >  - not recommend any tools?
> 
> This is the approach that has been tried and it's, objectively, failed
> miserably.  Our users are ending up with invalid and unusable backups,
> corrupted WAL segments, inability to use PITR, and various other issues
> because we've been trying to pretend that this isn't a hard problem.  We
> really need to stop that and accept that it's hard and promote the tools
> which have been explicitly written to address that hard problem.

I can sympathize that but is there any difference with system backups?
One can just copy $HOME to another directory in the same drive then
call it a day. Another uses dd to make a image backup. Others need
durability or guarantee for integrity or even encryption so acquire or
purchase a tool that conforms their requirements.  Or someone creates
their own backup solution that meets their requirements.

On the other hand, what OS distributors offer a long list for
requirements or a recipe for perfect backups? (Yeah, I'm saying this
based on nothing, just from a prejudice.)

If the system is serious, who don't know enough about backup ought to
consult professionals before building an inadequate backup system and
lose their data.

> > > Hmm.  A simple command that could be used as reference is for example
> > > "dd" that flushes the file by itself, or we could just revisit the
> > > discussions about having a pg_copy command, or we could document a
> > > small utility in perl that does the job.
> > 
> > I think we should do that if pg_copy comforms the mandatory
> > requirements but maybe it's in the future. Showing the minimal
> > implement in perl looks good.
> 
> Already tried doing it in perl.  No, it's not simple and it's also
> entirely vaporware today and implies that we're going to develop this
> tool, improve it in the future as we realize it needs to be improved,
> and maintain it as part of core forever.  If we want to actually adopt
> and pull in a backup tool to be part of core then we should talk about
> things which actually exist, such as the various existing projects that
> have been written to specifically work to address all the requirements
> which are understood today, not say "well, we can just write a simple
> perl script to do it" because it's not actually that 

Re: Duplicate history file?

2021-06-15 Thread Stephen Frost
Greetings,

On Tue, Jun 15, 2021 at 21:11 Julien Rouhaud  wrote:

> On Tue, Jun 15, 2021 at 02:28:04PM -0400, Stephen Frost wrote:
> >
> > * Julien Rouhaud (rjuju...@gmail.com) wrote:
> > > On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> > >
> > > The fact that this is such a complex problem is the very reason why we
> should
> > > spend a lot of energy documenting the various requirements.
> Otherwise, how
> > > could anyone implement a valid program for that and how could anyone
> validate
> > > that a solution claiming to do its job actually does its job?
> >
> > Reading the code.
>
> Oh, if it's as simple as that then surely documenting the various
> requirements
> won't be an issue.


As I suggested previously- this is similar to the hooks that we provide. We
don’t extensively document them because if you’re writing an extension
which uses a hook, you’re going to be (or should be..) reading the code too.

Consider that, really, an archive command should refuse to allow archiving
of WAL on a timeline which doesn’t have a corresponding history file in the
archive for that timeline (excluding timeline 1). Also, a backup tool
should compare the result of pg_start_backup to what’s in the control file,
using a fresh read, after start backup returns to make sure that the
storage is sane and not, say, cache’ing pages independently (such as might
happen with a separate NFS mount..).  Oh, and if a replica is involved, a
check should be done to see if the replica has changed timelines and an
appropriate message thrown if that happens complaining that the backup was
aborted due to the promotion of the replica…

To be clear- these aren’t checks that pgbackrest has today and I’m not
trying to make it out as if pgbackrest is the only solution and the only
tool that “does everything and is correct” because we aren’t there yet and
I’m not sure we ever will be “all correct” or “done”.

These, however, are ones we have planned to add because of things we’ve
seen and thought of, most of them in just the past few months.

Thanks,

Stephen

>


Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Christensen


>> I had not really looked at the patch, but if there's a cleanup portion to 
>> the same
>> patch as you're adding the YB too, then maybe it's worth separating those out
>> into another patch so that the two can be considered independently.
> 
> I agree with this opinion. It seems to me that we should think about units 
> and refactoring separately.
> Sorry for the confusion.

Sure thing, I think that makes sense. Refactor with existing units and debate 
the number of additions units to include. I do think Petabytes and Exabytes are 
at least within the realm of ones we should include; less tied to ZB and YB; 
just included for completeness. 



RE: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread Shinya11.Kato
>I had not really looked at the patch, but if there's a cleanup portion to the 
>same
>patch as you're adding the YB too, then maybe it's worth separating those out
>into another patch so that the two can be considered independently.

I agree with this opinion. It seems to me that we should think about units and 
refactoring separately.
Sorry for the confusion.

Best regards,
Shinya Kato


Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 6:15 PM David Rowley  wrote:
> On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan  wrote:
> > Whether or not we throw the plan back at the planner or "really change
> > our minds at execution time" seems like a distinction without a
> > difference.
>
> What is "really change our minds at execution time"?  Is that switch
> to another plan without consulting the planner?

I don't know what it means. That was my point -- it all seems like
semantics to me.

The strong separation between plan time and execution time isn't
necessarily a good thing, at least as far as solving some of the
thorniest problems goes. It seems obvious to me that cardinality
estimation is the main problem, and that the most promising solutions
are all fundamentally about using execution time information to change
course. Some problems with planning just can't be solved at plan time
-- no model can ever be smart enough. Better to focus on making query
execution more robust, perhaps by totally changing the plan when it is
clearly wrong. But also by using more techniques that we've
traditionally thought of as execution time techniques (e.g. role
reversal in hash join). The distinction is blurry to me.

There are no doubt practical software engineering issues with this --
separation of concerns and whatnot. But it seems premature to go into
that now.

> The new information might cause the join order to
> completely change. It might not be as simple as swapping a Nested Loop
> for a Hash Join.

I agree that it might not be that simple at all. I think that Robert
is saying that this is one case where it really does appear to be that
simple, and so we really can expect to benefit from a simple plan-time
heuristic that works within the confines of the current model. Why
wouldn't we just take that easy win, once the general idea has been
validated some more? Why let the perfect be the enemy of the good?

I have perhaps muddied the waters by wading into the more general
question of robust execution, the inherent uncertainty with
cardinality estimation, and so on. Robert really didn't seem to be
talking about that at all (though it is clearly related).

> > Either way we're changing our minds about the plan based
> > on information that is fundamentally execution time information, not
> > plan time information. Have I missed something?
>
> I don't really see why you think the number of rows that a given join
> might produce is execution information.

If we're 100% sure a join will produce at least n rows because we
executed it (with the express intention of actually doing real query
processing that returns rows to the client), and it already produced n
rows, then what else could it be called? Why isn't it that simple?

> It's exactly the sort of
> information the planner needs to make a good plan. It's just that
> today we get that information from statistics. Plenty of other DBMSs
> make decisions from sampling.

> FWIW, we do already have a minimalist
> sampling already in get_actual_variable_range().

I know, but that doesn't seem all that related -- it almost seems like
the opposite idea. It isn't the executor balking when it notices that
the plan is visibly wrong during execution, in some important way.
It's more like the planner using the executor to get information about
an index that is well within the scope of what we think of as plan
time.

To some degree the distinction gets really blurred due to nodes like
hash join, where some important individual decisions are delayed until
execution time already. It's really unclear when precisely it stops
being that, and starts being more of a case of either partially or
wholly replanning. I don't know how to talk about it without it being
confusing.

> I'm just trying to highlight that I don't think overloading nodes is a
> good path to go down.  It's not a sustainable practice. It's a path
> towards just having a single node that does everything. If your
> suggestion was not serious then there's no point in discussing it
> further.

As I said, it was a way of framing one particular issue that Robert is
concerned about.

-- 
Peter Geoghegan




Re: Improving isolationtester's data output

2021-06-15 Thread Tom Lane
Andres Freund  writes:
> On 2021-06-15 19:26:25 -0400, Tom Lane wrote:
>> Going forward it wouldn't be a problem, but back-patching isolation
>> test cases might find it annoying.  On the other hand, my nearby
>> patch to improve isolation test stability is already going to create
>> issues of that sort.  (Unless, dare I say it, we back-patch that.)

> It might be worth to back-patch - aren't there some back branch cases of
> test instability? And perhaps more importantly, I'm sure we'll encounter
> cases of writing new isolation tests in the course of fixing bugs that
> we'd want to backpatch that are hard to make reliable without the new
> features?

Yeah, there absolutely is a case to back-patch things like this.  Whether
it's a strong enough case, I dunno.  I'm probably too close to the patch
to have an unbiased opinion about that.

However, a quick look through the commit history finds several places
where we complained about not being able to back-patch isolation tests to
before 9.6 because we hadn't back-patched that version's isolationtester
improvements.  I found 6b802cfc7, 790026972, c88411995, 8b21b416e without
looking too hard.  So that history certainly suggests that not
back-patching such test infrastructure is the Wrong Thing.

(And yeah, the failures we complained of in the other thread are
certainly there in the back branches.  I think the only reason there
seem to be fewer is that the back branches see fewer test runs.)

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Thomas Munro
On Wed, Jun 16, 2021 at 7:17 AM Robert Haas  wrote:
> Progress has been pretty limited, but not altogether nonexistent.
> 55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
> fix, the time->XID mapping, which is one of the main things that
> Andres said was broken originally. Also, there are patches on this
> thread from Thomas Munro to add some test coverage for that case,
> another problem Andres noted in his original email. I guess it
> wouldn't be too hard to get something committed there, and I'm willing
> to do it if Thomas doesn't want to and if there's any prospect of
> salvaging the feature.

FTR the latest patches are on a different thread[1].  I lost steam on
that stuff because I couldn't find a systematic way to deal with the
lack of checks all over the place, or really understand how the whole
system fits together with confidence.  Those patches to fix an xid
wraparound bug and make the testing work better may be useful and I'll
be happy to rebase them, depending on how this discussion goes, but it
seems a bit like the proverbial deckchairs on the Titanic from what
I'm reading...  I think the technique I showed for exercising some
basic STO mechanisms and scenarios is probably useful, but I currently
have no idea how to prove much of anything about the whole system and
am not personally in a position to dive into that rabbit hole in a
PG15 time scale.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGJyw%3DuJ4eL1x%3D%2BvKm16fLaxNPvKUYtnChnRkSKi024u_A%40mail.gmail.com




Re: Improving isolationtester's data output

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-15 19:26:25 -0400, Tom Lane wrote:
> Going forward it wouldn't be a problem, but back-patching isolation
> test cases might find it annoying.  On the other hand, my nearby
> patch to improve isolation test stability is already going to create
> issues of that sort.  (Unless, dare I say it, we back-patch that.)

It might be worth to back-patch - aren't there some back branch cases of
test instability? And perhaps more importantly, I'm sure we'll encounter
cases of writing new isolation tests in the course of fixing bugs that
we'd want to backpatch that are hard to make reliable without the new
features?

Greetings,

Andres Freund




Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Tom Lane
Andres Freund  writes:
> Only halfway related: I wonder if we should remove the automatic
> permutation stuff - it's practically never useful. Probably not worth
> changing...

Where it is useful, it saves a lot of error-prone typing ...

> Minor suggestion: I think the folliwing would be a bit easier to read if
> there first were a list of markers, and then separately the longer
> descriptions. Right now it's a bit hard to see which paragraph
> introduces a new type of marker, and which just adds further commentary.

OK, will do.  Will act on your other cosmetic points too, tomorrow or so.

>> +if (step_has_blocker(pstep))
>> +{
>> +if (!(flags & STEP_RETRY))
>> +printf("step %s: %s \n",
>> +   step->name, step->sql);
>> +return true;
>> +}

> Might be a bug in my mental state machine: Will this work correctly for
> PSB_ONCE, where we'll already returned before?

This bit ignores PSB_ONCE.  Once we've dropped out of try_complete_step
the first time, PSB_ONCE is done affecting things.  (I'm not in love
with that symbol name, if you have a better idea.)

regards, tom lane




Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Andres Freund
Hi,

Only halfway related: I wonder if we should remove the automatic
permutation stuff - it's practically never useful. Probably not worth
changing...


On 2021-06-15 17:09:00 -0400, Tom Lane wrote:
> +The general form of a permutation entry is
> +
> + "step_name" [ ( marker [ , marker ... ] ) ]
> +
> +where each marker defines a "blocking condition".  The step will not be
> +reported as completed before all the blocking conditions are satisfied.

Minor suggestion: I think the folliwing would be a bit easier to read if
there first were a list of markers, and then separately the longer
descriptions. Right now it's a bit hard to see which paragraph
introduces a new type of marker, and which just adds further commentary.


> + /*
> +  * Check for other steps that have finished.  
> We must do this
> +  * if oldstep completed; while if it did not, 
> we need to poll
> +  * all the active steps in hopes of unblocking 
> oldstep.
> +  */

Somehow I found the second sentence a bit hard to parse - I think it's
the "while ..." sounding a bit odd to me.


> + /*
> +  * If the target session is still busy, apply a 
> timeout to
> +  * keep from hanging indefinitely, which could 
> happen with
> +  * incorrect blocker annotations.  Use the same 
> 2 *
> +  * max_step_wait limit as try_complete_step 
> does for deciding
> +  * to die.  (We don't bother with trying to 
> cancel anything,
> +  * since it's unclear what to cancel in this 
> case.)
> +  */
> + if (iconn->active_step != NULL)
> + {
> + struct timeval current_time;
> + int64   td;
> +
> + gettimeofday(_time, NULL);
> + td = (int64) current_time.tv_sec - 
> (int64) start_time.tv_sec;
> + td *= USECS_PER_SEC;
> + td += (int64) current_time.tv_usec - 
> (int64) start_time.tv_usec;
>+  if (td > 2 * max_step_wait)
> + {
> + fprintf(stderr, "step %s timed 
> out after %d seconds\n",
> + 
> iconn->active_step->name,
> + (int) (td / 
> USECS_PER_SEC));
> + exit(1);
> + }
> + }
> + }
>   }

It might be worth printing out the list of steps the being waited for
when reaching the timeout - it seems like it'd now be easier to end up
with a bit hard to debug specs. And one cannot necessarily look at
pg_locks or such anymore to debug em.


> @@ -833,6 +946,19 @@ try_complete_step(TestSpec *testspec, Step *step, int 
> flags)
>   }
>   }
>  
> + /*
> +  * The step is done, but we won't report it as complete so long as there
> +  * are blockers.
> +  */
> + if (step_has_blocker(pstep))
> + {
> + if (!(flags & STEP_RETRY))
> + printf("step %s: %s \n",
> +step->name, step->sql);
> + return true;
> + }

Might be a bug in my mental state machine: Will this work correctly for
PSB_ONCE, where we'll already returned before?


Greetings,

Andres Freund




Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley
On Wed, 16 Jun 2021 at 12:11, Peter Geoghegan  wrote:
> Whether or not we throw the plan back at the planner or "really change
> our minds at execution time" seems like a distinction without a
> difference.

What is "really change our minds at execution time"?  Is that switch
to another plan without consulting the planner?  If so what decides
what that new plan should be? The planner is meant to be the expert in
that department. The new information might cause the join order to
completely change. It might not be as simple as swapping a Nested Loop
for a Hash Join.

> Either way we're changing our minds about the plan based
> on information that is fundamentally execution time information, not
> plan time information. Have I missed something?

I don't really see why you think the number of rows that a given join
might produce is execution information. It's exactly the sort of
information the planner needs to make a good plan. It's just that
today we get that information from statistics. Plenty of other DBMSs
make decisions from sampling. FWIW, we do already have a minimalist
sampling already in get_actual_variable_range().

I'm just trying to highlight that I don't think overloading nodes is a
good path to go down.  It's not a sustainable practice. It's a path
towards just having a single node that does everything. If your
suggestion was not serious then there's no point in discussing it
further.

David




Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud
On Tue, Jun 15, 2021 at 02:28:04PM -0400, Stephen Frost wrote:
> 
> * Julien Rouhaud (rjuju...@gmail.com) wrote:
> > On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> > 
> > The fact that this is such a complex problem is the very reason why we 
> > should
> > spend a lot of energy documenting the various requirements.  Otherwise, how
> > could anyone implement a valid program for that and how could anyone 
> > validate
> > that a solution claiming to do its job actually does its job?
> 
> Reading the code.

Oh, if it's as simple as that then surely documenting the various requirements
won't be an issue.




Re: Different compression methods for FPI

2021-06-15 Thread Michael Paquier
On Tue, Jun 15, 2021 at 11:14:56AM -0500, Justin Pryzby wrote:
> You're right, I hadn't though this through all the way.
> There's precedent if the default is non-static (wal_sync_method).
> 
> But I think yes/on/true/1 should be a compatibility alias for a static thing,
> and then the only option is pglz.
> 
> Of course, changing the default to LZ4 is still a possibility.

We have not reached yet a conclusion with the way we are going to
parameterize all that, so let's adapt depending on the feedback.  For
now, I am really interested in this patch set, so I am going to run
some tests of my own and test more around the compression levels we
have at our disposals with the proposed algos.

From I'd like us to finish with here is one new algorithm method, able
to cover a large range of cases as mentioned upthread, from
low-CPU/low-compression to high-CPU/high-compression.  It does not
seem like a good idea to be stuck with an algo that only specializes
in one or the other, for example.
--
Michael


signature.asc
Description: PGP signature


Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-15 15:17:05 -0400, Robert Haas wrote:
> But that's not clear to me. I'm not clear how exactly how many
> problems we know about and need to fix in order to keep the feature,
> and I'm also not clear how deep the hole goes. Like, if we need to get
> a certain number of specific bugs fixed, I might be willing to do
> that. If we need to commit to a major rewrite of the current
> implementation, that's more than I can do. But I guess I don't
> understand exactly how bad the current problems are. Reviewing
> complaints from Andres from this thread:

One important complaints I think your (useful!) list missed is that there's
missing *read side* checks that demonstrably lead to wrong query results:
https://www.postgresql.org/message-id/CAH2-Wz%3DFQ9rbBKkt1nXvz27kmd4A8i1%2B7dcLTNqpCYibxX83VQ%40mail.gmail.com
and that it's currently very hard to figure out where they need to be, because
there's no real explained model of what needs to be checked and what not.


> > * I am fairly sure that it can cause crashes (or even data corruption),
> >  because it assumes that DML never needs to check for old snapshots
> >  (with no meaningful justificiation). Leading to heap_update/delete to
> >  assume the page header is a tuple.
> 
> I don't understand the issue here, really. I assume there might be a
> wrong word here, because assuming that the page header is a tuple
> doesn't sound like a thing that would actually happen.

I suspect what I was thinking of is that a tuple could get pruned away due to
s_t_o, which would leave a LP_DEAD item around. As heap_update/delete neither
checks s_t_o, nor balks at targetting LP_DEAD items, we'd use the offset from
a the LP_DEAD item. ItemIdSetDead() sets lp_off to 0 - which would mean that
the page header is interpreted as a tuple. Right?


> I think one of the key problems for this feature is figuring out
> whether you've got snapshot-too-old checks in all the right places. I
> think what is being alleged here is that heap_update() and
> heap_delete() need them, and that it's not good enough to rely on the
> scan that found the tuple to be updated or deleted having already
> performed those checks. It is not clear to me whether that is true, or
> how it could cause crashes.  Andres may have explained this to me at
> some point, but if he did I have unfortunately forgotten.

I don't think it is sufficient to rely on the scan. That works only as long as
the page with the to-be-modified tuple is pinned (since that'd prevent pruning
/ vacuuming from working on the page), but I am fairly sure that there are
plans where the target tuple is not pinned from the point it was scanned until
it is modified. In which case it is entirely possible that the u/d target can
be pruned away due to s_t_o between the scan checking s_t_o and the u/d
executing.


> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter

Correct. I think there's numerous architectural issues the way the feature is
implemented right now, and that it'd be a substantial project to address them.


> For the record, and to Peter's point, I think it's reasonable to set
> v15 feature freeze as a drop-dead date for getting this feature into
> acceptable shape, but I would like to try to nail down what we think
> "acceptable" means in this context.

I think the absolute minimum would be to have
- actually working tests
- a halfway thorough code review of the feature
- added documentation explaining where exactly s_t_o tests need to be
- bugfixes obviously

If I were to work on the feature, I cannot imagine being sufficient confident
the feature works as long as the xid->time mapping granularity is a
minute. It's just not possible to write reasonable tests with the granularity
being that high. Or even to do manual tests of it - I'm not that patient. But
I "can accept" if somebody actually doing the work differs on this.

Greetings,

Andres Freund




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 5:00 PM David Rowley  wrote:
> I don't really think we should solve this by having nodeNestloop.c
> fall back on hashing when the going gets tough.  Overloading our nodes
> that way is not a sustainable thing to do.  I'd rather see the
> executor throw the plan back at the planner along with some hints
> about what was wrong with it.  We could do that providing we've not
> sent anything back to the client yet.

It wasn't a serious suggestion -- it was just a way of framing the
issue at hand that I thought might be useful.

If we did have something like that (which FWIW I think makes sense but
is hard to do right in a general way) then it might be expected to
preemptively refuse to even start down the road of using an
unparameterized nestloop join very early, or even before execution
time. Such an adaptive join algorithm/node might be expected to have a
huge bias against this particular plan shape, that can be reduced to a
simple heuristic. But you can have the simple heuristic without
needing to build everything else.

Whether or not we throw the plan back at the planner or "really change
our minds at execution time" seems like a distinction without a
difference. Either way we're changing our minds about the plan based
on information that is fundamentally execution time information, not
plan time information. Have I missed something?

-- 
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread David Rowley
On Wed, 16 Jun 2021 at 05:09, Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

It's not great that people are having to use such blunt instruments to
get the planner to behave.  It might not be a terrible idea to provide
them with something a bit sharper as you suggest. The GUC idea is
certainly something that could be done without too much effort.

There was some talk of doing that in [1].

> Another approach
> would be to introduce a large fuzz factor for such nested loops e.g.
> keep them only if the cost estimate is better than the comparable hash
> join plan by at least a factor of N (or if no such plan is being
> generated for some reason).

In my experience, the most common reason that the planner chooses
non-parameterized nested loops wrongly is when there's row
underestimation that says there's just going to be 1 row returned by
some set of joins.  The problem often comes when some subsequent join
is planned and the planner sees the given join rel only produces one
row.  The cheapest join method we have to join 1 row is Nested Loop.
So the planner just sticks the 1-row join rel on the outer side
thinking the executor will only need to scan the inner side of the
join once.  When the outer row count blows up, then we end up scanning
that inner side many more times. The problem is compounded when you
nest it a few joins deep

Most of the time when I see that happen it's down to either the
selectivity of some correlated base-quals being multiplied down to a
number low enough that we clamp the estimate to be 1 row.   The other
case is similar, but with join quals.

It seems to me that each time we multiply 2 selectivities together
that the risk of the resulting selectivity being out increases.  The
risk is likely lower when we have some extended statistics which
allows us to do fewer selectivity multiplications.

For that 1-row case, doing an unparameterized nested loop is only the
cheapest join method by a tiny amount.  It really wouldn't be much
more expensive to just put that single row into a hash table.  If that
1 estimated row turns out to be 10 actual rows then it's likely not
too big a deal for the hash join code to accommodate the 9 additional
rows.

This makes me think that it's insanely risky for the planner to be
picking Nested Loop joins in this case. And that puts me down the path
of thinking that this problem should be solved by having the planner
take into account risk as well as costs when generating plans.

I don't really have any concrete ideas on that, but a basic idea that
I have been considering is that a Path has a risk_factor field that is
decided somewhere like clauselist_selectivity(). Maybe the risk can go
up by 1 each time we multiply an individual selectivity. (As of
master, estimate_num_groups() allows the estimation to pass back some
further information to the caller. I added that for Result Cache so I
could allow the caller to get visibility about when the estimate fell
back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get
similar treatment to allow the risk_factor or number of nstats_used to
be passed back).  We'd then add a GUC, something like
planner_risk_adversion which could be set to anything from 0.0 to some
positive number. During add_path() we could do the cost comparison
like:  path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion)
< path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion).
That way, if you set planner_risk_adversion to 0.0, then the planner
does as it does today, i.e takes big risks.

The problem I have with this idea is that I really don't know how to
properly calculate what the risk_factor should be set to.  It seems
easy at first to set it to something that has the planner avoid these
silly 1-row estimate nested loop mistakes, but I think what we'd set
the risk_factor to would become much more important when more and more
Path types start using it. So if we did this and just guessed the
risk_factor, that might be fine when only 1 of the paths being
compared had a non-zero risk_factor, but as soon as both paths have
one set, unless they're set to something sensible, then we just end up
comparing garbage costs to garbage costs.

Another common mistake the planner makes is around WHERE a = 
ORDER BY b LIMIT n; where there are separate indexes on (a) and (b).
Scanning the (b) index is pretty risky. All the "a" values you need
might be right at the end of the index. It seems safer to scan the (a)
index as we'd likely have statistics that tell us how many rows exist
with . We don't have any stats that tell us where in the (b)
index are all the rows with a = .

I don't really think we should solve this by having nodeNestloop.c

Re: Support for NSS as a libpq TLS backend

2021-06-15 Thread Jacob Champion
On Wed, 2021-06-16 at 00:08 +0200, Daniel Gustafsson wrote:
> > On 15 Jun 2021, at 00:15, Jacob Champion  wrote:
> > Attached is a quick patch; does it work on your machine?
> 
> It does, thanks!  I've included it in the attached v37 along with a few tiny
> non-functional improvements in comment spelling etc.

Great, thanks!

I've been tracking down reference leaks in the client. These open
references prevent NSS from shutting down cleanly, which then makes it
impossible to open a new context in the future. This probably affects
other libpq clients more than it affects psql.

The first step to fixing that is not ignoring failures during NSS
shutdown, so I've tried a patch to pgtls_close() that pushes any
failures through the pqInternalNotice(). That seems to be working well.
The tests were still mostly green, so I taught connect_ok() to fail if
any stderr showed up, and that exposed quite a few failures.

I am currently stuck on one last failing test. This leak seems to only
show up when using TLSv1.2 or below. There doesn't seem to be a
substantial difference in libpq code coverage between 1.2 and 1.3, so
I'm worried that either 1) there's some API we use that "requires"
cleanup, but only on 1.2 and below, or 2) there's some bug in my
version of NSS.

Attached are a few work-in-progress patches. I think the reference
cleanups themselves are probably solid, but the rest of it could use
some feedback. Are there better ways to test for this? and can anyone
reproduce the TLSv1.2 leak?

--Jacob
From 6976d15a4be50276ea2f77fd5c37d238493f9447 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 15 Jun 2021 10:01:14 -0700
Subject: [PATCH 1/3] nss: don't ignore failures during context shutdown

The biggest culprit of a shutdown failure so far seems to be object
leaks. A failure here may prevent future contexts from being created
(and they'll fail in confusing ways), so don't ignore it.

There's not a great way to signal errors from this layer of the stack,
but a notice will at least give us a chance of seeing the problem.
---
 src/interfaces/libpq/fe-secure-nss.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-secure-nss.c 
b/src/interfaces/libpq/fe-secure-nss.c
index 1b9bd4f1a5..90f57e0d99 100644
--- a/src/interfaces/libpq/fe-secure-nss.c
+++ b/src/interfaces/libpq/fe-secure-nss.c
@@ -139,7 +139,16 @@ pgtls_close(PGconn *conn)
 {
if (conn->nss_context)
{
-   NSS_ShutdownContext(conn->nss_context);
+   SECStatus   status;
+   status = NSS_ShutdownContext(conn->nss_context);
+
+   if (status != SECSuccess)
+   {
+   pqInternalNotice(>noticeHooks,
+"unable to shut down 
NSS context: %s",
+
pg_SSLerrmessage(PR_GetError()));
+   }
+
conn->nss_context = NULL;
}
 }
-- 
2.25.1

From acabc4d51d38124db748a6be9dca0ff83693e039 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Tue, 15 Jun 2021 14:37:28 -0700
Subject: [PATCH 2/3] test: check for empty stderr during connect_ok()

...in a similar manner to command_like(), to catch notices-as-errors
coming from NSS.
---
 src/test/authentication/t/001_password.pl | 2 +-
 src/test/authentication/t/002_saslprep.pl | 2 +-
 src/test/kerberos/t/001_auth.pl   | 2 +-
 src/test/ldap/t/001_auth.pl   | 2 +-
 src/test/perl/PostgresNode.pm | 5 -
 src/test/ssl/t/001_ssltests.pl| 4 ++--
 src/test/ssl/t/002_scram.pl   | 4 ++--
 7 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/src/test/authentication/t/001_password.pl 
b/src/test/authentication/t/001_password.pl
index 427a360198..327dfb4ed9 100644
--- a/src/test/authentication/t/001_password.pl
+++ b/src/test/authentication/t/001_password.pl
@@ -20,7 +20,7 @@ if (!$use_unix_sockets)
 }
 else
 {
-   plan tests => 23;
+   plan tests => 32;
 }
 
 
diff --git a/src/test/authentication/t/002_saslprep.pl 
b/src/test/authentication/t/002_saslprep.pl
index f080a0ccba..d6508df8bd 100644
--- a/src/test/authentication/t/002_saslprep.pl
+++ b/src/test/authentication/t/002_saslprep.pl
@@ -17,7 +17,7 @@ if (!$use_unix_sockets)
 }
 else
 {
-   plan tests => 12;
+   plan tests => 20;
 }
 
 # Delete pg_hba.conf from the given node, add a new entry to it
diff --git a/src/test/kerberos/t/001_auth.pl b/src/test/kerberos/t/001_auth.pl
index b5594924ca..62015339a0 100644
--- a/src/test/kerberos/t/001_auth.pl
+++ b/src/test/kerberos/t/001_auth.pl
@@ -23,7 +23,7 @@ use Time::HiRes qw(usleep);
 
 if ($ENV{with_gssapi} eq 'yes')
 {
-   plan tests => 44;
+   plan tests => 54;
 }
 else
 {
diff --git a/src/test/ldap/t/001_auth.pl b/src/test/ldap/t/001_auth.pl
index 0ae14e4c85..e857c26258 100644
--- a/src/test/ldap/t/001_auth.pl
+++ b/src/test/ldap/t/001_auth.pl
@@ -9,7 

Re: Improving isolationtester's data output

2021-06-15 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Jun-15, Tom Lane wrote:
>> If we wanted to buy into that, I'd think about discarding this
>> ad-hoc code altogether in favor of using one of libpq's fe-print.c
>> routines.  But I'm not really sure that the small legibility gains
>> that would result are worth massive changes in the output files.

> Shrug -- it's a one time change.  It wouldn't bother me, for one.

Going forward it wouldn't be a problem, but back-patching isolation
test cases might find it annoying.  On the other hand, my nearby
patch to improve isolation test stability is already going to create
issues of that sort.  (Unless, dare I say it, we back-patch that.)

I do find it a bit attractive to create some regression-testing
coverage of fe-print.c.  We are never going to remove that code,
AFAICS, so getting some benefit from it would be nice.

regards, tom lane




Re: Improving isolationtester's data output

2021-06-15 Thread Alvaro Herrera
On 2021-Jun-15, Tom Lane wrote:

> I've been spending a lot of time looking at isolationtester results
> over the past couple of days, and gotten really annoyed at how poorly
> it formats query results.  In particular, any column heading or value
> that is 15 characters or longer is not separated from the next column,
> rendering the output quite confusing.

Yeah, I noticed this too.

> Attached is a little hack that tries to improve that case while making
> minimal changes to the output files otherwise.

Seems pretty reasonable.

> There's still a good deal to be desired here: notably, the code still
> does nothing to ensure vertical alignment of successive lines when
> there are wide headings or values.  But doing anything about that
> would involve much-more-invasive changes of the output files.
> If we wanted to buy into that, I'd think about discarding this
> ad-hoc code altogether in favor of using one of libpq's fe-print.c
> routines.  But I'm not really sure that the small legibility gains
> that would result are worth massive changes in the output files.

Shrug -- it's a one time change.  It wouldn't bother me, for one.

-- 
Álvaro Herrera   Valdivia, Chile
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)




Improving isolationtester's data output

2021-06-15 Thread Tom Lane
I've been spending a lot of time looking at isolationtester results
over the past couple of days, and gotten really annoyed at how poorly
it formats query results.  In particular, any column heading or value
that is 15 characters or longer is not separated from the next column,
rendering the output quite confusing.

Attached is a little hack that tries to improve that case while making
minimal changes to the output files otherwise.

There's still a good deal to be desired here: notably, the code still
does nothing to ensure vertical alignment of successive lines when
there are wide headings or values.  But doing anything about that
would involve much-more-invasive changes of the output files.
If we wanted to buy into that, I'd think about discarding this
ad-hoc code altogether in favor of using one of libpq's fe-print.c
routines.  But I'm not really sure that the small legibility gains
that would result are worth massive changes in the output files.

Thoughts?

regards, tom lane

diff --git a/src/test/isolation/expected/detach-partition-concurrently-3.out b/src/test/isolation/expected/detach-partition-concurrently-3.out
index 96ee090d53..3f553711b3 100644
--- a/src/test/isolation/expected/detach-partition-concurrently-3.out
+++ b/src/test/isolation/expected/detach-partition-concurrently-3.out
@@ -9,7 +9,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -35,7 +35,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -54,7 +54,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -77,7 +77,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -95,7 +95,7 @@ a
 1  
 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach2: <... completed>
@@ -121,7 +121,7 @@ a
 1  
 step s2detach2: ALTER TABLE d3_listp DETACH PARTITION d3_listp2 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach2: <... completed>
@@ -150,7 +150,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -172,7 +172,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -194,7 +194,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -213,7 +213,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -233,7 +233,7 @@ a
 1  
 step s2detach: ALTER TABLE d3_listp DETACH PARTITION d3_listp1 CONCURRENTLY; 
 step s1cancel: SELECT pg_cancel_backend(pid), pg_sleep(0.1) FROM d3_pid;
-pg_cancel_backendpg_sleep   
+pg_cancel_backend pg_sleep   
 
 t 
 step s2detach: <... completed>
@@ -252,7 +252,7 @@ a
 1  
 step s2detach: ALTER TABLE 

Re: Error on pgbench logs

2021-06-15 Thread Michael Paquier
On Tue, Jun 15, 2021 at 09:53:06PM +0900, Yugo NAGATA wrote:
> I am fine with this version, but I think it would be better if we have a 
> comment
> explaining what "tx" is for.
> 
> Also, how about adding Assert(tx) instead of using "else if (tx)" because
> we are assuming that tx is always true when agg_interval is not used, right?

Agreed on both points.  From what I get, this code could be clarified
much more, and perhaps partially refactored to have less spaghetti
code between the point where we call it at the end of a thread or when
gathering stats of a transaction mid-run, but that's not something to
do post-beta1.  I am not completely sure that the result would be
worth it either.

Let's document things and let's the readers know better the
assumptions this area of the code relies on, for clarity.  The 
dependency between agg_interval and sample_rate is one of those
things, somebody needs now to look at the option parsing why only one
is possible at the time.  Using an extra tx flag to track what to do
after the loop for the aggregate print to the log file is an
improvement in this direction.
--
Michael


signature.asc
Description: PGP signature


Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane
Peter Geoghegan  writes:
> What I had in mind was this: a committer adopting the feature
> themselves. The committer would be morally obligated to maintain the
> feature on an ongoing basis, just as if they were the original
> committer. This seems like the only sensible way of resolving this
> issue once and for all.

Yeah, it seems clear that we need somebody to do that, given that
Kevin Grittner has been inactive for awhile.  Even if the known
problems can be resolved by drive-by patches, I think this area
needs an ongoing commitment from someone.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Wed, Apr 1, 2020 at 4:59 PM Andres Freund  wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

This is still true, right? Nobody fixed this bug after 14 months? Even
though we're talking about returning rows that are not visible to the
xact's snapshot?

--
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 12:17 PM Robert Haas  wrote:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion. I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified. For the record, and to
> Peter's point, I think it's reasonable to set v15 feature freeze as a
> drop-dead date for getting this feature into acceptable shape, but I
> would like to try to nail down what we think "acceptable" means in
> this context.

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

If it really is incredibly important that we keep this feature, or one
like it, then I have to imagine that somebody will step forward --
there is still ample opportunity. But if nobody steps forward, I'll be
forced to conclude that perhaps it wasn't quite as important as I
first thought. Anybody can agree that it's important in an abstract
sense -- that's easy. What we need is a committer willing to sign on
the dotted line, which we're no closer to today than we were a year
ago. Actions speak louder than words.

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 12:49 PM Tom Lane  wrote:
> Robert Haas  writes:
> > My general point here is that I would like to know whether we have a
> > finite number of reasonably localized bugs or a three-ring disaster
> > that is unrecoverable no matter what we do. Andres seems to think it
> > is the latter, and I *think* Peter Geoghegan agrees, but I think that
> > the point might be worth a little more discussion.
>
> TBH, I am not clear on that either.

I don't know for sure which it is, but that in itself isn't actually
what matters to me. The most concerning thing is that I don't really
know how to *assess* the design now. The clear presence of at least
several very severe bugs doesn't necessarily prove anything (it just
*hints* at major design problems).

If I could make a very clear definitive statement on this then I'd
probably have to do ~1/3 of the total required work -- that'd be my
guess. If it was easy to be quite sure here then we wouldn't still be
here 12 months later. In any case I don't think that the feature
deserves to be treated all that differently to something that was
committed much more recently, given what we know. Frankly it took me
about 5 minutes to find a very serious bug in the feature, pretty much
without giving it any thought. That is not a good sign.

> I think it's a klugy, unprincipled solution to a valid real-world
> problem.  I suspect the implementation issues are not unrelated to
> the kluginess of the concept.  Thus, I would really like to see us
> throw this away and find something better.  I admit I have nothing
> to offer about what a better solution to the problem would look like.
> But I would really like it to not involve random-seeming query failures.

I would be very happy to see somebody take this up, because it is
important. The reality is that anybody that undertakes this task
should start with the assumption that they're starting from scratch,
at least until they learn otherwise. So ISTM that it might as well be
true that it needs a total rewrite, even if it turns out to not be
strictly true.

--
Peter Geoghegan




Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 12:31 PM Robert Haas  wrote:
> Yes, I think it is. Reading the paper really helped me crystallize my
> thoughts about this, because when I've studied the problems myself, I
> came, as you postulate here, to the conclusion that there's a lot of
> stuff the planner does where there is risk and uncertainty, and thus
> that a general framework would be necessary to deal with it.

It is an example (perhaps the only example in the optimizer) of an
oasis of certainty in an ocean of uncertainty. As uncertain as
everything is, we seemingly can make strong robust statements about
the relative merits of each strategy *in general*, just in this
particular instance. It's just not reasonable to make such a reckless
choice, no matter what your general risk tolerance is.

Goetz Graefe is interviewed here, and goes into his philosophy on
robustness -- it seems really interesting to me:

https://sigmodrecord.org/publications/sigmodRecord/2009/pdfs/05_Profiles_Graefe.pdf

> In defense of that approach, note that this is a
> case where we know both that the Nested Loop is risky and that Hash
> Join is a similar alternative with probably similar cost. I am not
> sure there are any other cases where we can say quite so generally
> both that a certain thing is risky and what we could do instead.

I tend to think of a hash join as like a nested loop join with an
inner index scan where you build the index yourself, dynamically. That
might be why I find it easy to make this mental leap. In theory you
could do this by giving the nestloop join runtime smarts -- make it
turn into a hash join adaptively. Like Graefe's G-Join design. That
way you could do this in a theoretically pure way.

I don't think that that's actually necessary just to deal with this
case -- it probably really is as simple as it seems. I point this out
because perhaps it's useful to have that theoretical anchoring.

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane
Robert Haas  writes:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion.

TBH, I am not clear on that either.

> I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified.

I think it's a klugy, unprincipled solution to a valid real-world
problem.  I suspect the implementation issues are not unrelated to
the kluginess of the concept.  Thus, I would really like to see us
throw this away and find something better.  I admit I have nothing
to offer about what a better solution to the problem would look like.
But I would really like it to not involve random-seeming query failures.

regards, tom lane




Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Ranier Vilela
Em ter., 15 de jun. de 2021 às 15:48, Andres Freund 
escreveu:

> Hi,
>
> On 2021-06-15 13:53:08 -0400, Tom Lane wrote:
> > Andres Freund  writes:
> > > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> > >> memcpy would not suffer from it?
> >
> > > It'd not be correct for short sqlstates - you'd read beyond the end of
> > > the source buffer. There are cases of it in the ecpg code.
> >
> > What's a "short SQLSTATE"?  They're all five characters by definition.
>
> I thought there were places that just dealt with "00" etc. And there are -
> but
> it's just comparisons.
>
> I still don't fully feel comfortable just using memcpy() though, given that
> the sqlstates originate remotely / from libpq, making it hard to rely on
> the
> fact that the buffer "ought to" always be at least 5 bytes long? As far as
> I
> can tell there's no enforcement of PQresultErrorField(...,
> PG_DIAG_SQLSTATE)
> being that long.
>
And replacing with snprintf, what do you guys think?

n = snprintf(sqlca->sqlstate, sizeof(sqlca->sqlstate), "%s", sqlstate);
Assert(n >= 0 && n < sizeof(sqlca->sqlstate));

regards,
Ranier Vilela


fix_buffer_null_not_terminated.patch
Description: Binary data


Re: disfavoring unparameterized nested loops

2021-06-15 Thread Robert Haas
On Tue, Jun 15, 2021 at 2:04 PM Peter Geoghegan  wrote:
> I guess that there is a hesitation to not introduce heuristics like
> this because it doesn't fit into some larger framework that captures
> risk -- it might be seen as an ugly special case. But isn't this
> already actually kind of special, whether or not we officially think
> so?

Yes, I think it is. Reading the paper really helped me crystallize my
thoughts about this, because when I've studied the problems myself, I
came, as you postulate here, to the conclusion that there's a lot of
stuff the planner does where there is risk and uncertainty, and thus
that a general framework would be necessary to deal with it. But the
fact that an academic researcher called this problem out as the only
one worth treating specially makes me think that perhaps it deserves
special handling. In defense of that approach, note that this is a
case where we know both that the Nested Loop is risky and that Hash
Join is a similar alternative with probably similar cost. I am not
sure there are any other cases where we can say quite so generally
both that a certain thing is risky and what we could do instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Andrew Dunstan


On 6/14/21 10:57 PM, Tom Lane wrote:
> This is a followup to the conversation at [1], in which we speculated
> about constraining the isolationtester's behavior by annotating the
> specfiles, in order to eliminate common buildfarm failures such as [2]
> and reduce the need to use long delays to stabilize the test results.
>
> I've spent a couple days hacking on this idea, and I think it has worked
> out really well.  On my machine, the time needed for "make installcheck"
> in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
> of removing all the multiple-second delays we used before.  Also,
> while I'm not fool enough to claim that this will reduce the rate of
> bogus failures to zero, I do think it addresses all the repeating
> failures we've seen lately.
>
> In the credit-where-credit-is-due department, this owes some inspiration
> to the patch Asim Praveen offered at [3], though this takes the idea a
> good bit further.
>
> This is still WIP to some extent, as I've not spent much time looking at
> specfiles other than the ones with big delays; there may be additional
> improvements possible in some places.  Also, I've not worried about
> whether the tests pass in serializable mode, since we have problems there
> already [4].  But this seemed like a good point at which to solicit
> feedback and see what the cfbot thinks of it.
>
>   


Cool stuff. Minor gripe while we're on $subject - I wish we'd rename it.
It's long outgrown the original purpose that gave it its name, and
keeping the name makes it unnecessarily obscure. Yes, I know Lisp still
has car and cdr, but we don't need to follow that example.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Robert Haas
On Tue, Jun 15, 2021 at 12:51 PM Tom Lane  wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

Progress has been pretty limited, but not altogether nonexistent.
55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
fix, the time->XID mapping, which is one of the main things that
Andres said was broken originally. Also, there are patches on this
thread from Thomas Munro to add some test coverage for that case,
another problem Andres noted in his original email. I guess it
wouldn't be too hard to get something committed there, and I'm willing
to do it if Thomas doesn't want to and if there's any prospect of
salvaging the feature.

But that's not clear to me. I'm not clear how exactly how many
problems we know about and need to fix in order to keep the feature,
and I'm also not clear how deep the hole goes. Like, if we need to get
a certain number of specific bugs fixed, I might be willing to do
that. If we need to commit to a major rewrite of the current
implementation, that's more than I can do. But I guess I don't
understand exactly how bad the current problems are. Reviewing
complaints from Andres from this thread:

> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

This seems specific enough to be analyzed and anything that is broken
can be fixed.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

I think it's unclear whether there are live problems in master in this area.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

Discussion on this point trailed off. Upon rereading, I think Andres
is correct that there's an issue; the snapshot's LSN needs to be set
to a value not older than the last xlog insertion that has been
completed rather than, as now, the last one that is started. I guess
to get that value we would need to do something like
WaitXLogInsertionsToFinish(), or some approximation of it e.g.
GetXLogWriteRecPtr() at the risk of unnecessary snapshot-too-old
errors.

> * In read-mostly workloads it can trigger errors in sessions that are
>  much younger than old_snapshot_threshold, if the transactionid is not
>  advancing.
>
>  I've not tried to reproduce, but I suspect this can also cause wrong
>  query results. Because a later snapshot can have the same xmin as
>  older transactions, it sure looks like we can end up with data from an
>  older xmin getting removed, but the newer snapshot's whenTaken will
>  prevent TestForOldSnapshot_impl() from raising an error.

I haven't really wrapped my head around this one, but it seems
amenable to a localized fix. It basically amounts to a complaint that
GetOldSnapshotThresholdTimestamp() is returning a newer value than it
should. I don't know exactly what's required to make it not do that,
but it doesn't seem intractable.

> * I am fairly sure that it can cause crashes (or even data corruption),
>  because it assumes that DML never needs to check for old snapshots
>  (with no meaningful justificiation). Leading to heap_update/delete to
>  assume the page header is a tuple.

I don't understand the issue here, really. I assume there might be a
wrong word here, because assuming that the page header is a tuple
doesn't sound like a thing that would actually happen. I think one of
the key problems for this feature is figuring out whether you've got
snapshot-too-old checks in all the right places. I think what is being
alleged here is that heap_update() and heap_delete() need them, and
that it's not good enough to rely on the scan that found the tuple to
be updated or deleted having already performed those checks. It is not
clear to me whether that is true, or how it could cause crashes.
Andres may have explained this to me at some point, but if he did I
have unfortunately forgotten.

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion. I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified. For the record, and to

Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Tom Lane
Andres Freund  writes:
> On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
>> This is still WIP to some extent, as I've not spent much time looking at
>> specfiles other than the ones with big delays; there may be additional
>> improvements possible in some places.  Also, I've not worried about
>> whether the tests pass in serializable mode, since we have problems there
>> already [4].  But this seemed like a good point at which to solicit
>> feedback and see what the cfbot thinks of it.

> Are there spec output changes / new failures, if you apply the patch,
> but do not apply the changes to the spec files?

If you make only the code changes, there are a bunch of diffs stemming
from the removal of the 'error in steps' message prefix.  If you just
mechanically remove those from the .out files without touching the .spec
files, most tests pass, but I don't recall whether that's 100% the case.

> Will look at the patch itself in a bit.

I'll have a v2 in a little bit --- the cfbot pointed out that there
were some contrib tests I'd missed fixing, and I found a couple of
other improvements.

regards, tom lane




Re: Improving the isolationtester: fewer failures, less delay

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-14 22:57:08 -0400, Tom Lane wrote:
> This is a followup to the conversation at [1], in which we speculated
> about constraining the isolationtester's behavior by annotating the
> specfiles, in order to eliminate common buildfarm failures such as [2]
> and reduce the need to use long delays to stabilize the test results.
> 
> I've spent a couple days hacking on this idea, and I think it has worked
> out really well.  On my machine, the time needed for "make installcheck"
> in src/test/isolation drops from ~93 seconds to ~26 seconds, as a result
> of removing all the multiple-second delays we used before.

Very cool stuff. All the reliability things aside, isolationtester
frequently is the slowest test in a parallel check world...


> Also, while I'm not fool enough to claim that this will reduce the
> rate of bogus failures to zero, I do think it addresses all the
> repeating failures we've seen lately.

And it should make it easier to fix some others and also to make it
easier to write some tests that were too hard to get to reliable today.


> This is still WIP to some extent, as I've not spent much time looking at
> specfiles other than the ones with big delays; there may be additional
> improvements possible in some places.  Also, I've not worried about
> whether the tests pass in serializable mode, since we have problems there
> already [4].  But this seemed like a good point at which to solicit
> feedback and see what the cfbot thinks of it.

Are there spec output changes / new failures, if you apply the patch,
but do not apply the changes to the spec files?


Will look at the patch itself in a bit.

Greetings,

Andres Freund




Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-15 13:53:08 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> >> memcpy would not suffer from it?
> 
> > It'd not be correct for short sqlstates - you'd read beyond the end of
> > the source buffer. There are cases of it in the ecpg code.
> 
> What's a "short SQLSTATE"?  They're all five characters by definition.

I thought there were places that just dealt with "00" etc. And there are - but
it's just comparisons.

I still don't fully feel comfortable just using memcpy() though, given that
the sqlstates originate remotely / from libpq, making it hard to rely on the
fact that the buffer "ought to" always be at least 5 bytes long? As far as I
can tell there's no enforcement of PQresultErrorField(..., PG_DIAG_SQLSTATE)
being that long.

Greetings,

Andres Freund




Re: unnesting multirange data types

2021-06-15 Thread Andrew Dunstan


On 6/15/21 1:52 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2021-Jun-15, Tom Lane wrote:
>>> I think this ought to be reverted and reviewed more carefully.
>> It seems to me that removing the cast-to-range[] is a sufficient fix,
>> and that we can do with only the unnest part for pg14; the casts can be
>> added in 15 (if at all).  That would mean reverting only half the patch.
> Might be a reasonable solution.  But right now I'm annoyed that the
> buildfarm is broken, and I'm also convinced that this didn't get
> adequate testing.  I think "revert and reconsider" is the way
> forward for today.
>
>   



(RMT hat on) That would be my inclination at this stage. The commit
message states that it's trivial, but it seems not to be, and I suspect
it should not have been done at this stage of the development cycle.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: unnesting multirange data types

2021-06-15 Thread Alexander Korotkov
On Tue, Jun 15, 2021 at 8:18 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > I did run "check-world", but it passed for me.  Probably the same
> > reason it passed for some buildfarm animals...
>
> The only buildfarm animals that have passed since this went in
> are the ones that don't run the pg_dump or pg_upgrade tests.
>
> It looks to me like the proximate problem is that you should
> have taught pg_dump to skip these new auto-generated functions.
> However, I fail to see why we need auto-generated functions
> for this at all.  Couldn't we have done it with one polymorphic
> function?
>
> I think this ought to be reverted and reviewed more carefully.

Thank you for your feedback.  I've reverted the patch.

I'm going to have closer look at this tomorrow.

--
Regards,
Alexander Korotkov




Re: unnesting multirange data types

2021-06-15 Thread Andrew Dunstan


On 6/15/21 12:06 PM, Alexander Korotkov wrote:
> On Tue, Jun 15, 2021 at 4:49 PM Tom Lane  wrote:
>> Alexander Korotkov  writes:
>>> Pushed!  Thanks to thread participants for raising this topic and
>>> review.  I'll be around to resolve issues if any.
>> Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
>> before pushing?
> Yes, I'm looking at this now.
>
> I did run "check-world", but it passed for me.  Probably the same
> reason it passed for some buildfarm animals...
>

Did you configure with --enable-tap-tests? If not, then `make
check-world` won't run the tests that are failing here.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Duplicate history file?

2021-06-15 Thread Stephen Frost
Greetings,

* Julien Rouhaud (rjuju...@gmail.com) wrote:
> On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> > The requirements are things which are learned over years and changes
> > over time.  Trying to document them and keep up with them would be a
> > pretty serious project all on its own.  There are external projects who
> > spend serious time and energy doing their best to provide the tooling
> > needed here and we should be promoting those, not trying to pretend like
> > this is a simple thing which anyone could write a short perl script to
> > accomplish.
> 
> The fact that this is such a complex problem is the very reason why we should
> spend a lot of energy documenting the various requirements.  Otherwise, how
> could anyone implement a valid program for that and how could anyone validate
> that a solution claiming to do its job actually does its job?

Reading the code.

> > Already tried doing it in perl.  No, it's not simple and it's also
> > entirely vaporware today and implies that we're going to develop this
> > tool, improve it in the future as we realize it needs to be improved,
> > and maintain it as part of core forever.  If we want to actually adopt
> > and pull in a backup tool to be part of core then we should talk about
> > things which actually exist, such as the various existing projects that
> > have been written to specifically work to address all the requirements
> > which are understood today, not say "well, we can just write a simple
> > perl script to do it" because it's not actually that simple.
> 
> Adopting a full backup solution seems like a bit extreme.  On the other hand,
> having some real core implementation of an archive_command for the most 
> general
> use cases (local copy, distant copy over ssh...) could make sense.  This would
> remove that burden for some, probably most, of the 3rd party backup tools, and
> would also ensure that the various requirements are properly documented since
> it would be the implementation reference.

Having a database platform that hasn't got a full backup solution is a
pretty awkward position to be in.

I'd like to see something a bit more specific than handwaving about how
core could provide something in this area which would remove the burden
from other tools.  Would also be good to know who is going to write that
and maintain it.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 11:01 AM Tom Lane  wrote:
> The goal I have in mind is for snapshot_too_old to be fixed or gone
> in v15.  I don't feel a need to force the issue sooner than that, so
> there's plenty of time for someone to step up, if anyone wishes to.

Seems more than reasonable to me. A year ought to be plenty of time if
the feature truly is salvageable.

What do other people think? Ideally we could commit to that hard
deadline now. To me the important thing is to actually have a real
deadline that forces the issue one way or another. This situation must
not be allowed to drag on forever.

-- 
Peter Geoghegan




Re: a path towards replacing GEQO with something better

2021-06-15 Thread John Naylor
On Tue, Jun 15, 2021 at 12:15 PM Robert Haas  wrote:
>
> On Wed, Jun 9, 2021 at 9:24 PM John Naylor 
wrote:
> > 3) It actually improves the existing exhaustive search, because the
complexity of the join order problem depends on the query shape: a "chain"
shape (linear) vs. a "star" shape (as in star schema), for the most common
examples. The size of the DP table grows like this (for n >= 4):
...
> I don't quite understand the difference between the "chain" case and
> the "star" case. Can you show sample queries for each one? e.g. SELECT
> ... FROM a_1, a_2, ..., a_n WHERE ?

There's a very simple example in the optimizer README:

--
SELECT  *
FROMtab1, tab2, tab3, tab4
WHERE   tab1.col = tab2.col AND
tab2.col = tab3.col AND
tab3.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:
{1 2},{2 3},{3 4}
{1 2 3},{2 3 4}
{1 2 3 4}
(other possibilities will be excluded for lack of join clauses)

SELECT  *
FROMtab1, tab2, tab3, tab4
WHERE   tab1.col = tab2.col AND
tab1.col = tab3.col AND
tab1.col = tab4.col

Tables 1, 2, 3, and 4 are joined as:
{1 2},{1 3},{1 4}
{1 2 3},{1 3 4},{1 2 4}
{1 2 3 4}
--

The first one is chain, and the second is star. Four is the smallest set
where we have a difference. I should now point out an imprecision in my
language: By "size of the DP table", the numbers in my first email refer to
the number of joinrels times the number of possible joins (not paths, and
ignoring commutativity). Here are the steps laid out, with cumulative
counts:

join_level, # joins,  cumulative # joins:

linear, n=4
 2   3   3
 3   4   7
 4   3  10

star, n=4
 2   3   3
 3   6   9
 4   3  12

And of course, the chain query also has three at the last level, because it
tries two left- (or right-) deep joins and one bushy join.

> One idea I just ran across in
>
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans.

That's a good paper, and it did influence my thinking.

You likely already know this, but for the archives: If only chain queries
could have bushy plans, it wouldn't matter because they are so cheap to
enumerate. But, since star queries can introduce a large number of extra
joins via equivalence (same join column or FK), making them resemble
"clique" queries, bushy joins get excessively numerous.

> We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.

I think that makes sense. There are a few things we could do within the
"grey zone" -- too many rels to finish exhaustive search, but not enough to
justify starting directly with the greedy step -- to increase our chances
of completing, and that's a very simple one.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: disfavoring unparameterized nested loops

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 10:09 AM Robert Haas  wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.

I think that it's probably a good idea as-is.

Simple heuristics that are very frequently wrong when considered in a
naive way can work very well in practice. This seems to happen when
they capture some kind of extreme naturally occuring cost/benefit
asymmetry -- especially one with fixed well understood costs and
unlimited benefits (this business with unparameterized nestloop joins
is about *avoiding* the inverse asymmetry, but that seems very
similar). My go to example of such an asymmetry is the rightmost page
split heuristic of applying leaf fillfactor regardless of any of the
other specifics; we effectively assume that all indexes are on columns
with ever-increasing values. Which is obviously wrong.

We're choosing between two alternatives (unparameterized nested loop
vs hash join) that are really very similar when things go as expected,
but diverge sharply when there is a misestimation -- who wouldn't take
the "conservative" choice here?

I guess that there is a hesitation to not introduce heuristics like
this because it doesn't fit into some larger framework that captures
risk -- it might be seen as an ugly special case. But isn't this
already actually kind of special, whether or not we officially think
so?

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane
Peter Geoghegan  writes:
> On Tue, Jun 15, 2021 at 9:51 AM Tom Lane  wrote:
>> So, it's well over a year later, and so far as I can see exactly
>> nothing has been done about snapshot_too_old's problems.

> I propose that the revert question be explicitly timeboxed. If the
> issues haven't been fixed by some date, then "snapshot too old"
> automatically gets reverted without further discussion. This gives
> qualified hackers the opportunity to save the feature if they feel
> strongly about it, and are actually willing to take responsibility for
> its ongoing maintenance.

The goal I have in mind is for snapshot_too_old to be fixed or gone
in v15.  I don't feel a need to force the issue sooner than that, so
there's plenty of time for someone to step up, if anyone wishes to.

I imagine that we should just ignore the question of whether anything
can be done for it in the back branches.  Given the problems
identified upthread, fixing it in a non-back-patchable way would be
challenging enough.

regards, tom lane




Re: unnesting multirange data types

2021-06-15 Thread Jonathan S. Katz
On 6/15/21 1:52 PM, Tom Lane wrote:
> Alvaro Herrera  writes:
>> On 2021-Jun-15, Tom Lane wrote:
>>> I think this ought to be reverted and reviewed more carefully.
> 
>> It seems to me that removing the cast-to-range[] is a sufficient fix,
>> and that we can do with only the unnest part for pg14; the casts can be
>> added in 15 (if at all).  That would mean reverting only half the patch.
> 
> Might be a reasonable solution.  But right now I'm annoyed that the
> buildfarm is broken, and I'm also convinced that this didn't get
> adequate testing.

I had focused testing primarily on the "unnest" cases that I had
described in my original note. I did a couple of casts and had no issue;
I did not test with pg_dump / pg_upgrade, but noting to do so in the
future in cases like this.

>  I think "revert and reconsider" is the way
> forward for today.

I don't want the buildfarm broken so I'm fine if this is the best way
forward. If we can keep the "unnest" functionality I would strongly
suggest it as that was the premise of the original note to complete the
utility of multiranges for v14. The casting, while convenient, is not
needed.

Jonathan



OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Tom Lane
Andres Freund  writes:
> On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
>> memcpy would not suffer from it?

> It'd not be correct for short sqlstates - you'd read beyond the end of
> the source buffer. There are cases of it in the ecpg code.

What's a "short SQLSTATE"?  They're all five characters by definition.

regards, tom lane




Re: unnesting multirange data types

2021-06-15 Thread Tom Lane
Alvaro Herrera  writes:
> On 2021-Jun-15, Tom Lane wrote:
>> I think this ought to be reverted and reviewed more carefully.

> It seems to me that removing the cast-to-range[] is a sufficient fix,
> and that we can do with only the unnest part for pg14; the casts can be
> added in 15 (if at all).  That would mean reverting only half the patch.

Might be a reasonable solution.  But right now I'm annoyed that the
buildfarm is broken, and I'm also convinced that this didn't get
adequate testing.  I think "revert and reconsider" is the way
forward for today.

regards, tom lane




Re: [PATCH] Fix buffer not null terminated on (ecpg lib)

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-15 07:40:46 -0300, Ranier Vilela wrote:
> memcpy would not suffer from it?

It'd not be correct for short sqlstates - you'd read beyond the end of
the source buffer. There are cases of it in the ecpg code.

Greetings,

Andres Freund




Re: unnesting multirange data types

2021-06-15 Thread Alvaro Herrera
On 2021-Jun-15, Tom Lane wrote:

> It looks to me like the proximate problem is that you should
> have taught pg_dump to skip these new auto-generated functions.
> However, I fail to see why we need auto-generated functions
> for this at all.  Couldn't we have done it with one polymorphic
> function?

I think such a function would need to take anycompatiblerangearray,
which is not something we currently have.

> I think this ought to be reverted and reviewed more carefully.

It seems to me that removing the cast-to-range[] is a sufficient fix,
and that we can do with only the unnest part for pg14; the casts can be
added in 15 (if at all).  That would mean reverting only half the patch.

-- 
Álvaro Herrera39°49'30"S 73°17'W




Re: unnesting multirange data types

2021-06-15 Thread Tom Lane
Alexander Korotkov  writes:
> I did run "check-world", but it passed for me.  Probably the same
> reason it passed for some buildfarm animals...

The only buildfarm animals that have passed since this went in
are the ones that don't run the pg_dump or pg_upgrade tests.

It looks to me like the proximate problem is that you should
have taught pg_dump to skip these new auto-generated functions.
However, I fail to see why we need auto-generated functions
for this at all.  Couldn't we have done it with one polymorphic
function?

I think this ought to be reverted and reviewed more carefully.

regards, tom lane




Re: a path towards replacing GEQO with something better

2021-06-15 Thread Robert Haas
On Tue, Jun 15, 2021 at 1:00 PM Tom Lane  wrote:
> Still, I take your point that maybe we could ratchet down the cost of
> exhaustive search by skimping on this part.  Maybe it'd work to skip
> bushy so long as we'd found at least one left-deep or right-deep path
> for the current rel.

Yes, that sounds better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Peter Geoghegan
On Tue, Jun 15, 2021 at 9:51 AM Tom Lane  wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

FWIW I think that the concept itself is basically reasonable. The
implementation is very flawed, though, so it hardly enters into it.

> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

ISTM that this is currently everybody's responsibility, and therefore
nobody's responsibility. That's probably why the problems haven't been
resolved yet.

I propose that the revert question be explicitly timeboxed. If the
issues haven't been fixed by some date, then "snapshot too old"
automatically gets reverted without further discussion. This gives
qualified hackers the opportunity to save the feature if they feel
strongly about it, and are actually willing to take responsibility for
its ongoing maintenance.

-- 
Peter Geoghegan




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Andres Freund
Hi,

On 2021-06-15 12:51:28 -0400, Tom Lane wrote:
> Robert Haas  writes:
> > Oh, maybe I'm the one who misunderstood...
> 
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.
> 
> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

Greetings,

Andres Freund




disfavoring unparameterized nested loops

2021-06-15 Thread Robert Haas
>From time to time, someone tells me that they've configured
enable_nestloop=false on postgresql.conf, which is a pretty bad idea
since there are a significant number of cases where such plans are the
only reasonable way of executing some query. However, it's no great
secret that PostgreSQL's optimizer sometimes produces nested loops
that are very, very, very slow, generally because it has grossly
underestimated the cardinality of the inner side of the nested loop.
The frustration which users experience as a result of such events is
understandable.

I read 
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
today and found out that the authors of that paper did something a bit
more nuanced which, in their experiments, was very successful. It
sounds like what they basically did is disabled unparameterized nested
loops. They argue that such nested loops figure to gain very little as
compared with a hash join, but that they might turn out to lose a lot
if the cardinality estimation is inaccurate, and they present
experimental results to back up those claims. One observation that the
paper makes along the way is that every system they tested is more
likely to underestimate the cardinality of joins than to overestimate
it, and that this tendency becomes more pronounced as the size of the
join planning problem increases. On reflection, I believe this matches
my experience, and it makes sense that it should be so, since it
occasionally happens that the join selectivity estimate is essentially
zero, and a bigger join problem is more likely to have at least one
such case. On the other hand, the join selectivity estimate can never
be +infinity. Hence, it's more important in general for a database
system to be resilient against underestimates than to be resilient
against overestimates. Being less willing to choose unparameterized
nested loops is one way to move in that direction.

How to do that is not very clear. One very simple thing we could do
would be to introduce enable_nestloop=parameterized or
enable_parameterized_nestloop=false. That is a pretty blunt instrument
but the authors of the paper seem to have done that with positive
results, so maybe it's actually not a dumb idea. Another approach
would be to introduce a large fuzz factor for such nested loops e.g.
keep them only if the cost estimate is better than the comparable hash
join plan by at least a factor of N (or if no such plan is being
generated for some reason). I'm not very confident that this would
actually do what we want, though. In the problematic cases, a nested
loop is likely to look extremely cheap, so just imagining that the
cost might be higher is not very protective. Perhaps a better approach
would be something like: if the estimated number of inner rows is less
than 100, then re-estimate the cost of this approach and of the best
available hash join on the assumption that there are 100 inner rows.
If the hash join still wins, keep it; if it loses under that
assumption, throw it out. I think it's likely that this approach would
eliminate a large number of highly risky nested loop plans, probably
even with s/100/10/g, without causing many other problems (except
perhaps increased planner CPU consumption ... but maybe that's not too
bad).

Just to be clear, I do understand that there are cases where no Hash
Join is possible, but anything we do here could be made to apply only
when a hash join is in fact possible. We could also think about making
the point of comparison the best other plans of any sort rather than a
hash join specifically, which feels a little more principled but might
actually be worse. When a Nested Loop is a stupid idea, it's stupid
precisely because the inner side is big and we could've avoided
recomputing it over and over by using a Hash Join instead, not because
some Merge Join based plan turns out to be better. I mean, it is
possible that some Merge Join plan does turn out to be better, but
that's not rage-inducing in the same way. Nobody looks at a
complicated join plan that happened to use a Nested Loop and says
"obviously, this is inferior to a merge join," or if they do, they're
probably full of hot air. But people look at complicated join plans
that happen to use a Nested Loop and say "obviously, this is inferior
to a hash join" *all the time* and assuming the inner path is
unparameterized, they are very often correct.

Thoughts? I'd be particularly curious to hear about any cases anyone
knows about where an unparameterized nested loop and a hash join with
batches=1 are both possible and where the unparameterized nested loop
is *much* cheaper than the hash join.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: change logging defaults

2021-06-15 Thread Tom Lane
Justin Pryzby  writes:
> I propose to change some defaults:
> log_autovacuum_min_duration = 0
> log_checkpoints = on
> log_lock_waits = on (and log_recovery_conflict_waits too?)
> log_temp_files = 128kB

Why?

Based on reports that I see, some quite large percentage of Postgres
DBAs never look at the postmaster log at all.  So making the log
bulkier isn't something that will be useful to them.  People who do
want these reports are certainly capable of turning them on.

> Note that pg_regress does this:

What we find useful for testing seems to me to be nearly
unrelated to production needs.

regards, tom lane




Re: a path towards replacing GEQO with something better

2021-06-15 Thread Tom Lane
Robert Haas  writes:
> One idea I just ran across in
> https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
> is to try to economize by skipping consideration of bushy plans. We
> could start doing that when some budget is exceeded, similar to what
> you are proposing here, but probably the budget for skipping
> consideration of bushy plans would be smaller than the budget for
> switching to IDP. The idea of skipping bushy plan generation in some
> cases makes sense to me intuitively because most of the plans
> PostgreSQL generates are mostly left-deep, and even when we do
> generate bushy plans, they're not always a whole lot better than the
> nearest equivalent left-deep plan. The paper argues that considering
> bushy plans makes measurable gains, but also that failure to consider
> such plans isn't catastrophically bad.

It's not catastrophically bad until you hit a query where the only
correct plans are bushy.  These do exist, and I think they're not
that uncommon.

Still, I take your point that maybe we could ratchet down the cost of
exhaustive search by skimping on this part.  Maybe it'd work to skip
bushy so long as we'd found at least one left-deep or right-deep path
for the current rel.

regards, tom lane




Re: snapshot too old issues, first around wraparound and then more.

2021-06-15 Thread Tom Lane
Robert Haas  writes:
> Oh, maybe I'm the one who misunderstood...

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out.  If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

regards, tom lane




Re: Case expression pushdown

2021-06-15 Thread Alexander Pyhalov

Hi.

Ashutosh Bapat писал 2021-06-15 16:24:
Looks quite useful to me. Can you please add this to the next 
commitfest?




Addded to commitfest. Here is an updated patch version.

--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 80d60eb9b1630ee55d1825964e0e976ae6c289a1 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Tue, 30 Mar 2021 13:24:14 +0300
Subject: [PATCH] Allow pushing CASE expression to foreign server

---
 contrib/postgres_fdw/deparse.c| 118 ++
 .../postgres_fdw/expected/postgres_fdw.out|  47 +++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  22 
 3 files changed, 187 insertions(+)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 31919fda8c6..3621fed4b54 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -87,6 +87,7 @@ typedef struct foreign_loc_cxt
 {
 	Oid			collation;		/* OID of current collation, if any */
 	FDWCollateState state;		/* state of current collation choice */
+	Expr	   *case_arg;		/* the last case arg to inspect */
 } foreign_loc_cxt;
 
 /*
@@ -101,6 +102,7 @@ typedef struct deparse_expr_cxt
  * a base relation. */
 	StringInfo	buf;			/* output buffer to append to */
 	List	  **params_list;	/* exprs that will become remote Params */
+	List	   *case_args;		/* list of args to deparse CaseTestExpr */
 } deparse_expr_cxt;
 
 #define REL_ALIAS_PREFIX	"r"
@@ -186,6 +188,9 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
 
+static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
+static void deparseCaseTestExpr(CaseTestExpr *node, deparse_expr_cxt *context);
+
 /*
  * Helper functions
  */
@@ -254,6 +259,7 @@ is_foreign_expr(PlannerInfo *root,
 		glob_cxt.relids = baserel->relids;
 	loc_cxt.collation = InvalidOid;
 	loc_cxt.state = FDW_COLLATE_NONE;
+	loc_cxt.case_arg = NULL;
 	if (!foreign_expr_walker((Node *) expr, _cxt, _cxt))
 		return false;
 
@@ -312,6 +318,7 @@ foreign_expr_walker(Node *node,
 	/* Set up inner_cxt for possible recursion to child nodes */
 	inner_cxt.collation = InvalidOid;
 	inner_cxt.state = FDW_COLLATE_NONE;
+	inner_cxt.case_arg = outer_cxt->case_arg;
 
 	switch (nodeTag(node))
 	{
@@ -509,6 +516,62 @@ foreign_expr_walker(Node *node,
 	state = FDW_COLLATE_UNSAFE;
 			}
 			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *ce = (CaseExpr *) node;
+ListCell   *arg;
+
+if (ce->arg)
+	inner_cxt.case_arg = ce->arg;
+
+foreach(arg, ce->args)
+{
+	CaseWhen   *w = lfirst_node(CaseWhen, arg);
+
+	if (!foreign_expr_walker((Node *) w->expr,
+			 glob_cxt, _cxt))
+		return false;
+
+	if (!foreign_expr_walker((Node *) w->result,
+			 glob_cxt, _cxt))
+		return false;
+}
+
+if (!foreign_expr_walker((Node *) ce->defresult,
+		 glob_cxt, _cxt))
+	return false;
+
+collation = ce->casecollid;
+if (collation == InvalidOid)
+	state = FDW_COLLATE_NONE;
+else if (inner_cxt.state == FDW_COLLATE_SAFE &&
+		 collation == inner_cxt.collation)
+	state = FDW_COLLATE_SAFE;
+else if (collation == DEFAULT_COLLATION_OID)
+	state = FDW_COLLATE_NONE;
+else
+	state = FDW_COLLATE_UNSAFE;
+			}
+			break;
+		case T_CaseTestExpr:
+			{
+Expr	   *arg;
+
+Assert(outer_cxt->case_arg != NULL);
+arg = outer_cxt->case_arg;
+
+if (!foreign_expr_walker((Node *) arg,
+		 glob_cxt, _cxt))
+	return false;
+
+/*
+ * Collation and state just bubble up from the previously
+ * saved case argument
+ */
+collation = inner_cxt.collation;
+state = inner_cxt.state;
+			}
+			break;
 		case T_OpExpr:
 		case T_DistinctExpr:	/* struct-equivalent to OpExpr */
 			{
@@ -1019,6 +1082,7 @@ deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *rel,
 	context.foreignrel = rel;
 	context.scanrel = IS_UPPER_REL(rel) ? fpinfo->outerrel : rel;
 	context.params_list = params_list;
+	context.case_args = NIL;
 
 	/* Construct SELECT clause */
 	deparseSelectSql(tlist, is_subquery, retrieved_attrs, );
@@ -1598,6 +1662,7 @@ deparseFromExprForRel(StringInfo buf, PlannerInfo *root, RelOptInfo *foreignrel,
 			context.scanrel = foreignrel;
 			context.root = root;
 			context.params_list = params_list;
+			context.case_args = NIL;
 
 			appendStringInfoChar(buf, '(');
 			appendConditions(fpinfo->joinclauses, );
@@ -1901,6 +1966,7 @@ deparseDirectUpdateSql(StringInfo buf, PlannerInfo *root,
 	context.scanrel = foreignrel;
 	context.buf = buf;
 	context.params_list = params_list;
+	context.case_args = NIL;
 
 	appendStringInfoString(buf, "UPDATE ");
 	deparseRelation(buf, rel);
@@ -2008,6 +2074,7 @@ deparseDirectDeleteSql(StringInfo buf, PlannerInfo *root,
 	context.scanrel = foreignrel;
 	context.buf = buf;
 	context.params_list = params_list;
+	

Re: Duplicate history file?

2021-06-15 Thread Julien Rouhaud
On Tue, Jun 15, 2021 at 11:33:10AM -0400, Stephen Frost wrote:
> 
> The requirements are things which are learned over years and changes
> over time.  Trying to document them and keep up with them would be a
> pretty serious project all on its own.  There are external projects who
> spend serious time and energy doing their best to provide the tooling
> needed here and we should be promoting those, not trying to pretend like
> this is a simple thing which anyone could write a short perl script to
> accomplish.

The fact that this is such a complex problem is the very reason why we should
spend a lot of energy documenting the various requirements.  Otherwise, how
could anyone implement a valid program for that and how could anyone validate
that a solution claiming to do its job actually does its job?

> Already tried doing it in perl.  No, it's not simple and it's also
> entirely vaporware today and implies that we're going to develop this
> tool, improve it in the future as we realize it needs to be improved,
> and maintain it as part of core forever.  If we want to actually adopt
> and pull in a backup tool to be part of core then we should talk about
> things which actually exist, such as the various existing projects that
> have been written to specifically work to address all the requirements
> which are understood today, not say "well, we can just write a simple
> perl script to do it" because it's not actually that simple.

Adopting a full backup solution seems like a bit extreme.  On the other hand,
having some real core implementation of an archive_command for the most general
use cases (local copy, distant copy over ssh...) could make sense.  This would
remove that burden for some, probably most, of the 3rd party backup tools, and
would also ensure that the various requirements are properly documented since
it would be the implementation reference.




change logging defaults

2021-06-15 Thread Justin Pryzby
I propose to change some defaults:

log_autovacuum_min_duration = 0
log_checkpoints = on
log_lock_waits = on (and log_recovery_conflict_waits too?)
log_temp_files = 128kB

Note that pg_regress does this:
| fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
| fputs("log_autovacuum_min_duration = 0\n", pg_conf);
| fputs("log_checkpoints = on\n", pg_conf);
| fputs("log_line_prefix = '%m %b[%p] %q%a '\n", pg_conf);
| fputs("log_lock_waits = on\n", pg_conf);
| fputs("log_temp_files = 128kB\n", pg_conf);
| fputs("max_prepared_transactions = 2\n", pg_conf);

-- 
Justin




Re: a path towards replacing GEQO with something better

2021-06-15 Thread Robert Haas
On Wed, Jun 9, 2021 at 9:24 PM John Naylor  wrote:
> 3) It actually improves the existing exhaustive search, because the 
> complexity of the join order problem depends on the query shape: a "chain" 
> shape (linear) vs. a "star" shape (as in star schema), for the most common 
> examples. The size of the DP table grows like this (for n >= 4):
>
> Chain: (n^3 - n) / 6   (including bushy plans)
> Star:  (n - 1) * 2^(n - 2)
>
>  n  chain   star
> 
>  4 10 12
>  5 20 32
>  6 35 80
>  7 56192
>  8 84448
>  9120   1024
> 10165   2304
> 11220   5120
> 12286  11264
> 13364  24576
> 14455  53248
> 15560 114688
> ...
> 64  43680 290536219160925437952

I don't quite understand the difference between the "chain" case and
the "star" case. Can you show sample queries for each one? e.g. SELECT
... FROM a_1, a_2, ..., a_n WHERE ?

One idea I just ran across in
https://15721.courses.cs.cmu.edu/spring2020/papers/22-costmodels/p204-leis.pdf
is to try to economize by skipping consideration of bushy plans. We
could start doing that when some budget is exceeded, similar to what
you are proposing here, but probably the budget for skipping
consideration of bushy plans would be smaller than the budget for
switching to IDP. The idea of skipping bushy plan generation in some
cases makes sense to me intuitively because most of the plans
PostgreSQL generates are mostly left-deep, and even when we do
generate bushy plans, they're not always a whole lot better than the
nearest equivalent left-deep plan. The paper argues that considering
bushy plans makes measurable gains, but also that failure to consider
such plans isn't catastrophically bad.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Different compression methods for FPI

2021-06-15 Thread Justin Pryzby
On Tue, Jun 15, 2021 at 07:53:26AM +0200, Peter Eisentraut wrote:
> On 15.06.21 07:37, Michael Paquier wrote:
> > > > Actually, I was just thinking that default yes/no/on/off stuff maybe 
> > > > should be
> > > > defined to mean "lz4" rather than meaning pglz for "backwards 
> > > > compatibility".
> > > +1
> > I am not sure that we have any reasons to be that aggressive about
> > this one either, and this would mean that wal_compression=on implies a
> > different method depending on the build options.  I would just stick
> > with the past, careful practice that we have to use a default
> > backward-compatible value as default, while being able to use a new
> > option.

You're right, I hadn't though this through all the way.
There's precedent if the default is non-static (wal_sync_method).

But I think yes/on/true/1 should be a compatibility alias for a static thing,
and then the only option is pglz.

Of course, changing the default to LZ4 is still a possibility.

> If we think this new thing is strictly better than the old thing, then why
> not make it the default.  What would be the gain from sticking to the old
> default?
> 
> The point that the default would depend on build options is a valid one.
> I'm not sure whether it's important enough by itself.




Re: unnesting multirange data types

2021-06-15 Thread Alexander Korotkov
On Tue, Jun 15, 2021 at 4:49 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > Pushed!  Thanks to thread participants for raising this topic and
> > review.  I'll be around to resolve issues if any.
>
> Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
> before pushing?

Yes, I'm looking at this now.

I did run "check-world", but it passed for me.  Probably the same
reason it passed for some buildfarm animals...

--
Regards,
Alexander Korotkov




Should wal receiver reply to wal sender more aggressively?

2021-06-15 Thread Paul Guo
While working on some related issues I found that the wal receiver
tries to call walrcv_receive() loop
before replying the write/flush/apply LSN to wal senders in
XLogWalRcvSendReply(). It is possible
that walrcv_receive() loop receives and writes a lot of xlogs, so it
does not reply those LSN
information in time, thus finally slows down those transactions due to
syncrep wait (assuming default synchronous_commit)

In my TPCB testing, I found the worst case is that 10,466,469 bytes
were consumed in the walrcv_receive() loop.

More seriously, we call XLogWalRcvSendReply(false, false) after
handling those bytes; The first
argument false means no force , i.e. it notifies unless max time of
guc wal_receiver_status_interval
value (10s by default) is reached, so we may have to wait other calls
of XLogWalRcvSendReply()
to notify the wal sender.

I thought and tried enhancing this by force-replying to the wal sender
each when receiving
a maximum bytes (e.g. 128K) but several things confused me:

- What's the purpose of guc wal_receiver_status_interval? The OS
kernel is usually not
  efficient when handling small packets but we are not replying that
aggressively so why
  is this guc there?

- I run simple TPCB (1000 scaling, 200 connections, shared_buffers,
max_connections tuned)
  but found no obvious performance difference with and without the
code change. I did not
  see obvious system (IO/CPU/network) bottleneck - probably the
bottleneck is in PG itself.
  I did not investigate further at this moment, but the change should
in theory help, no?

Another thing came to my mind is the wal receiver logic:
Currently the wal receiver process does network io, wal write, wal
flush in one process.
Network io is async, blocking at epoll/poll, wal write is mostly
non-blocking, but for wal flush,
probably we could decouple it to a dedicated process. And maybe use
sync_file_range instead
of wal file fsync in issue_xlog_fsync()? We should sync those wal
contents with lower LSN at
first and reply to the wal sender in time, right?.

Below is the related code:

/* See if we can read data immediately */
len = walrcv_receive(wrconn, , _fd);
if (len != 0)
{
/*
 * Process the received data,
and any subsequent data we
 * can read without blocking.
 */
for (;;)
{
if (len > 0)
{
/*
 * Something
was received from primary, so reset
 * timeout
 */

last_recv_timestamp = GetCurrentTimestamp();
ping_sent = false;

XLogWalRcvProcessMsg(buf[0], [1], len - 1);
}
else if (len == 0)
break;
else if (len < 0)
{
ereport(LOG,

 (errmsg("replication terminated by primary server"),

  errdetail("End of WAL reached on timeline %u at %X/%X.",

startpointTLI,

LSN_FORMAT_ARGS(LogstreamResult.Write;
endofwal = true;
break;
}
len =
walrcv_receive(wrconn, , _fd);
}

/* Let the primary know that
we received some data. */
XLogWalRcvSendReply(false, false);

/*
 * If we've written some
records, flush them to disk and
 * let the startup process and
primary server know about
 * them.
 */
XLogWalRcvFlush(false);

-- 
Paul Guo (Vmware)




Re: Delegating superuser tasks to new security roles

2021-06-15 Thread Stephen Frost
Greetings,

* torikoshia (torikos...@oss.nttdata.com) wrote:
> On 2021-06-14 23:53, Mark Dilger wrote:
> >>On Jun 14, 2021, at 5:51 AM, torikoshia 
> >>wrote:
> >>BTW, do these patches enable non-superusers to create user with
> >>bypassrls?
[...]
> >Do you believe that functionality should be added?  I have not thought
> >much about that issue.
> 
> I just noticed that because I was looking into operations that can only be
> done by superusers.

In general, I agree with the sentiment that we should be providing a way
to have non-superusers able to do things that only a superuser can do
today.  I'd love to get rid of all of the explicit superuser checks in
the backend except the one that makes a superuser a member of all roles.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Rowley
On Wed, 16 Jun 2021 at 02:58, David Christensen
 wrote:
> That said, it seems like having the code structured in a way that we can 
> expand via adding an element to a table instead of the existing way it's 
> written with nested if blocks is still a useful refactor, whatever we decide 
> the cutoff units should be.

I had not really looked at the patch, but if there's a cleanup portion
to the same patch as you're adding the YB too, then maybe it's worth
separating those out into another patch so that the two can be
considered independently.

David




Re: Duplicate history file?

2021-06-15 Thread Stephen Frost
Greetings,

* Kyotaro Horiguchi (horikyota@gmail.com) wrote:
> At Fri, 11 Jun 2021 16:08:33 +0900, Michael Paquier  
> wrote in 
> > On Fri, Jun 11, 2021 at 03:32:28PM +0900, Kyotaro Horiguchi wrote:
> > > I think cp can be an example as far as we explain the limitations. (On
> > > the other hand "test !-f" cannot since it actually prevents server
> > > from working correctly.)
> > 
> > Disagreed.  I think that we should not try to change this area until
> > we can document a reliable solution, and a simple "cp" is not that.
> 
> Isn't removing cp from the documentation a change in this area? I
> basically agree to not to change anything but the current example
> "test ! -f  && cp .." and relevant description has been known to
> be problematic in a certain situation.

[...]

> - Write the full (known) requirements and use a pseudo tool-name in
>  the example?

I'm generally in favor of just using a pseudo tool-name and then perhaps
providing a link to a new place on .Org where people can ask to have
their PG backup solution listed, or something along those lines.

>  - provide a minimal implement of the command?

Having been down this road for a rather long time, I can't accept this
as a serious suggestion.  No, not even with Perl.  Been there, done
that, not going back.

>  - recommend some external tools (that we can guarantee that they
>comform the requriements)?

The requirements are things which are learned over years and changes
over time.  Trying to document them and keep up with them would be a
pretty serious project all on its own.  There are external projects who
spend serious time and energy doing their best to provide the tooling
needed here and we should be promoting those, not trying to pretend like
this is a simple thing which anyone could write a short perl script to
accomplish.

>  - not recommend any tools?

This is the approach that has been tried and it's, objectively, failed
miserably.  Our users are ending up with invalid and unusable backups,
corrupted WAL segments, inability to use PITR, and various other issues
because we've been trying to pretend that this isn't a hard problem.  We
really need to stop that and accept that it's hard and promote the tools
which have been explicitly written to address that hard problem.

> > Hmm.  A simple command that could be used as reference is for example
> > "dd" that flushes the file by itself, or we could just revisit the
> > discussions about having a pg_copy command, or we could document a
> > small utility in perl that does the job.
> 
> I think we should do that if pg_copy comforms the mandatory
> requirements but maybe it's in the future. Showing the minimal
> implement in perl looks good.

Already tried doing it in perl.  No, it's not simple and it's also
entirely vaporware today and implies that we're going to develop this
tool, improve it in the future as we realize it needs to be improved,
and maintain it as part of core forever.  If we want to actually adopt
and pull in a backup tool to be part of core then we should talk about
things which actually exist, such as the various existing projects that
have been written to specifically work to address all the requirements
which are understood today, not say "well, we can just write a simple
perl script to do it" because it's not actually that simple.

Providing yet another half solution would be doubling-down on the failed
approach to document a "simple" solution and would be a disservice to
our users.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Use singular number when appropriate

2021-06-15 Thread David Fetter
On Tue, Jun 15, 2021 at 09:37:11AM +0200, Laurenz Albe wrote:
> On Tue, 2021-06-15 at 04:59 +, David Fetter wrote:
> > I thought about using the dual, but wasn't sure how many languages
> > support it.
> 
> I think none of the languages for which we cater uses the dual.  But
> I guess you were joking, since the tests are not translated ...

I was.

> > if (fail_count == 0 && fail_ignore_count == 0)
> > snprintf(buf, sizeof(buf),
> > -_(" All %d tests passed. "),
> > -success_count);
> > +_(" %s %d test%s passed. "),
> > +success_count == 1 ? "The" : "All",
> > +success_count,
> > +success_count == 1 ? "" : "s");
> 
> ... and that wouldn't be translatable.

Thanks, will rearrange.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: pg_stat_statements and "IN" conditions

2021-06-15 Thread Dmitry Dolgov
> On Thu, Mar 18, 2021 at 04:50:02PM +0100, Dmitry Dolgov wrote:
> > On Thu, Mar 18, 2021 at 09:38:09AM -0400, David Steele wrote:
> > On 1/5/21 10:51 AM, Zhihong Yu wrote:
> > >
> > > +   int         lastExprLenght = 0;
> > >
> > > Did you mean to name the variable lastExprLenghth ?
> > >
> > > w.r.t. extracting to helper method, the second and third
> > > if (currentExprIdx == pgss_merge_threshold - 1) blocks are similar.
> > > It is up to you whether to create the helper method.
> > > I am fine with the current formation.
> >
> > Dmitry, thoughts on this review?
>
> Oh, right. lastExprLenghth is obviously a typo, and as we agreed that
> the helper is not strictly necessary I wanted to wait a bit hoping for
> more feedback and eventually to post an accumulated patch. Doesn't make
> sense to post another version only to fix one typo :)

Hi,

I've prepared a new rebased version to deal with the new way of
computing query id, but as always there is one tricky part. From what I
understand, now an external module can provide custom implementation for
query id computation algorithm. It seems natural to think this machinery
could be used instead of patch in the thread, i.e. one could create a
custom logic that will enable constants collapsing as needed, so that
same queries with different number of constants in an array will be
hashed into the same record.

But there is a limitation in how such queries will be normalized
afterwards — to reduce level of surprise it's necessary to display the
fact that a certain query in fact had more constants that are showed in
pgss record. Ideally LocationLen needs to carry some bits of information
on what exactly could be skipped, and generate_normalized_query needs to
understand that, both are not reachable for an external module with
custom query id logic (without replicating significant part of the
existing code). Hence, a new version of the patch.
>From 81a217485385452c7223a140dd4c98eeb5270945 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Thu, 10 Jun 2021 13:15:35 +0200
Subject: [PATCH v4] Prevent jumbling of every element in ArrayExpr

pg_stat_statements produces multiple entries for queries like

SELECT something FROM table WHERE col IN (1, 2, 3, ...)

depending on number of parameters, because every element of ArrayExpr is
jumbled. Make Consts (or any expression that could be reduced to a
Const) contribute nothing to the jumble hash if they're part of a series
and at position further that specified threshold. Do the same for
similar queries with VALUES as well.
---
 .../expected/pg_stat_statements.out   | 835 +-
 .../pg_stat_statements/pg_stat_statements.c   |  42 +-
 .../sql/pg_stat_statements.sql| 163 
 src/backend/utils/misc/guc.c  |  13 +
 src/backend/utils/misc/queryjumble.c  | 274 +-
 src/include/utils/queryjumble.h   |  11 +-
 6 files changed, 1323 insertions(+), 15 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/pg_stat_statements.out b/contrib/pg_stat_statements/expected/pg_stat_statements.out
index 40b5109b55..3fc1978066 100644
--- a/contrib/pg_stat_statements/expected/pg_stat_statements.out
+++ b/contrib/pg_stat_statements/expected/pg_stat_statements.out
@@ -205,7 +205,7 @@ SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C";
  INSERT INTO test VALUES(generate_series($1, $2), $3) | 1 |   10
  SELECT * FROM test ORDER BY a| 1 |   12
  SELECT * FROM test WHERE a > $1 ORDER BY a   | 2 |4
- SELECT * FROM test WHERE a IN ($1, $2, $3, $4, $5)   | 1 |8
+ SELECT * FROM test WHERE a IN ($1, $2, $3, $4, ...)  | 1 |8
  SELECT pg_stat_statements_reset()| 1 |1
  SELECT query, calls, rows FROM pg_stat_statements ORDER BY query COLLATE "C" | 0 |0
  UPDATE test SET b = $1 WHERE a = $2  | 6 |6
@@ -1067,4 +1067,837 @@ SELECT COUNT(*) FROM pg_stat_statements WHERE query LIKE '%SELECT GROUPING%';
  2
 (1 row)
 
+--
+-- Consts merging
+--
+SET pg_stat_statements.merge_threshold = 5;
+CREATE TABLE test_merge (id int, data int);
+-- IN queries
+-- Normal
+SELECT pg_stat_statements_reset();
+ pg_stat_statements_reset 
+--
+ 
+(1 row)
+
+SELECT * FROM test_merge WHERE id IN (1, 2, 3);
+ id | data 
++--
+(0 rows)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query  | calls 
++---
+ SELECT * FROM test_merge WHERE id IN ($1, $2, $3)  | 1
+ SELECT pg_stat_statements_reset()  | 1

Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Christensen
On Tue, Jun 15, 2021 at 8:26 AM David Rowley  wrote:

> On Tue, 15 Jun 2021 at 21:24,  wrote:
> > Hmmm, I didn't think YB was necessary, but what do others think?
>
> For me personally, without consulting Wikipedia, I know that Petabyte
> comes after Terabyte and then I'm pretty sure it's Exabyte.  After
> that, I'd need to check.
>
> Assuming I'm not the only person who can't tell exactly how many bytes
> are in a Yottabyte, would it actually be a readability improvement if
> we started showing these units to people?
>

I hadn't really thought about that TBH; to me it seemed like an
improvement, but I do see that others might not, and adding confusion is
definitely not helpful.  That said, it seems like having the code
structured in a way that we can expand via adding an element to a table
instead of the existing way it's written with nested if blocks is still a
useful refactor, whatever we decide the cutoff units should be.


> I'd say there might be some argument to implement as far as PB one
> day, maybe not that far out into the future, especially if we got
> something like built-in clustering. But I just don't think there's any
> need to go all out and take it all the way to YB.  There's an above
> zero chance we'll break something of someones by doing this, so I
> think any changes here should be driven off an actual requirement.
>

I got motivated to do this due to some (granted synthetic) work/workloads,
where I was seeing 6+digit TB numbers and thought it was ugly.  Looked at
the code and thought the refactor was the way to go, and just stuck all of
the known units in.


> I really think this change is more likely to upset someone than please
> someone.
>

I'd be interested to see reactions from people; to me, it seems a +1, but
seems like -1, 0, +1 all valid opinions here; I'd expect more 0's and +1s,
but I'm probably biased since I wrote this. :-)


Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Christensen
On Tue, Jun 15, 2021 at 8:31 AM Isaac Morland 
wrote:

> On Tue, 15 Jun 2021 at 05:24,  wrote:
>
>> >> I don't see the need to extend the unit to YB.
>> >> What use case do you have in mind?
>> >
>> >Practical or no, I saw no reason not to support all defined units. I
>> assume we’ll
>> >get to a need sooner or later. :)
>>
>> Thank you for your reply!
>> Hmmm, I didn't think YB was necessary, but what do others think?
>>
>
> If I’m reading the code correctly, the difference between supporting YB
> and not supporting it is whether there is a line for it in the list of
> prefixes and their multiples. As such, I don’t see why we’re even
> discussing whether or not to include all the standard prefixes. A YB is
> still an absurd amount of storage, but that’s not the point; just put all
> the standard prefixes and be done with it. If actual code changes were
> required in the new code as they are in the old it might be worth
> discussing.
>

Agreed, this is why I went this way.  One and done.


> One question: why is there no “k” in the list of prefixes?
>

kB has a special-case code block before you get to this point.  I didn't
look into the reasons, but assume there are some.


> Also: why not have only the prefixes in the array, and use a single fixed
> output format "%s %sB" all the time?
>
> It feels like it should be possible to calculate the appropriate idx to
> use (while adjusting the number to print as is done now) and then just have
> one psprintf call for all cases.
>

Sure, if that seems more readable/understandable.


> A more significant question is YB vs. YiB. I know there is a long
> tradition within computer-related fields of saying that k = 1024, M =
> 1024^2, etc., but we’re not special enough to override the more general
> principles of SI (Système International) which provide that k = 1000, M =
> 1000^2 and so on universally and provide the alternate prefixes ki, Mi,
> etc. which use 1024 as the multiple.
>
> So I would suggest either display 200 as 2MB or as 1.907MiB.
>

Heh, I was just expanding the existing logic; if others want to have this
particular battle go ahead and I'll adjust the code/prefixes, but obviously
the logic will need to change if we want to support true MB instead of MiB
as MB.

Also, this will presumably be a breaking change for anyone using the
existing units MB == 1024 * 1024, as we've had for something like 20
years.  Changing these units to the *iB will be trivial with this patch,
but not looking forward to garnering the consensus to change this part.

David


Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Amit Kapila
On Mon, Jun 14, 2021 at 9:08 PM Robert Haas  wrote:
>
> On Mon, Jun 14, 2021 at 2:32 AM Amit Kapila  wrote:
>
> > Yeah, this could be one idea but I think even if we use pg_proc OID,
> > we still need to check all the rel cache entries to find which one
> > contains the invalidated OID and that could be expensive. I wonder
> > can't we directly find the relation involved and register invalidation
> > for the same? We are able to find the relation to which trigger
> > function is associated during drop function via findDependentObjects
> > by scanning pg_depend. Assuming, we are able to find the relation for
> > trigger function by scanning pg_depend, what kinds of problems do we
> > envision in registering the invalidation for the same?
>
> I don't think that finding the relation involved and registering an
> invalidation for the same will work properly. Suppose there is a
> concurrently-running transaction which has created a new table and
> attached a trigger function to it. You can't see any of the catalog
> entries for that relation yet, so you can't invalidate it, but
> invalidation needs to happen. Even if you used some snapshot that can
> see those catalog entries before they are committed, I doubt it fixes
> the race condition. You can't hold any lock on that relation, because
> the creating transaction holds AccessExclusiveLock, but the whole
> invalidation mechanism is built around the assumption that the sender
> puts messages into the shared queue first and then releases locks,
> while the receiver first acquires a conflicting lock and then
> processes messages from the queue.
>

Won't such messages be proceesed at start transaction time
(AtStart_Cache->AcceptInvalidationMessages)?

> Without locks, that synchronization
> algorithm can't work reliably. As a consequence of all that, I believe
> that, not just in this particular case but in general, the
> invalidation message needs to describe the thing that has actually
> changed, rather than any derived property. We can make invalidations
> that say "some function's parallel-safety flag has changed" or "this
> particular function's parallel-safety flag has changed" or "this
> particular function has changed in some way" (this one, we have
> already), but anything that involves trying to figure out what the
> consequences of such a change might be and saying "hey, you, please
> update XYZ because I changed something somewhere that could affect
> that" is not going to be correct.
>
> > I think we probably need to worry about the additional cost to find
> > dependent objects and if there are any race conditions in doing so as
> > pointed out by Tom in his email [1]. The concern related to cost could
> > be addressed by your idea of registering such an invalidation only
> > when the user changes the parallel safety of the function which we
> > don't expect to be a frequent operation. Now, here the race condition,
> > I could think of could be that by the time we change parallel-safety
> > (say making it unsafe) of a function, some of the other sessions might
> > have already started processing insert on a relation where that
> > function is associated via trigger or check constraint in which case
> > there could be a problem. I think to avoid that we need to acquire an
> > Exclusive lock on the relation as we are doing in Rename Policy kind
> > of operations.
>
> Well, the big issue here is that we don't actually lock functions
> while they are in use. So there's absolutely nothing that prevents a
> function from being altered in any arbitrary way, or even dropped,
> while code that uses it is running. I don't really know what happens
> in practice if you do that sort of thing: can you get the same query
> to run with one function definition for the first part of execution
> and some other definition for the rest of execution? I tend to doubt
> it, because I suspect we cache the function definition at some point.
>

It is possible that in the same statement execution a different
function definition can be executed. Say, in session-1 we are
inserting three rows, on first-row execution definition-1 of function
in index expression gets executed. Now, from session-2, we change the
definition of the function to definition-2. Now, in session-1, on
second-row insertion, while executing definition-1 of function, we
insert in another table that will accept the invalidation message
registered in session-2. Now, on third-row insertion, the new
definition (definition-2) of function will be executed.

> If that's the case, caching the parallel-safety marking at the same
> point seems OK too, or at least no worse than what we're doing
> already. But on the other hand if it is possible for a query's notion
> of the function definition to shift while the query is in flight, then
> this is just another example of that and no worse than any other.
> Instead of changing the parallel-safety flag, somebody could redefine
> the function so that it divides by zero or 

Re: Fix around conn_duration in pgbench

2021-06-15 Thread Yugo NAGATA
On Mon, 14 Jun 2021 10:57:07 +0200 (CEST)
Fabien COELHO  wrote:
 
> > However, I found that conn_duration is calculated even when -C/--connect
> > is not specified, which is waste. SO we can remove this code as fixed in
> > the attached patch.
> 
> I'm fine with the implied code simplification, but it deserves a comment.

Thank you for adding comments!
 
> > In addition, deconnection delays are not cumulated even under -C/--connect
> > in spite of mentioned in the comment. I also fixed this in the attached 
> > patch.
> 
> I'm fine with that, even if it only concerns is_connect. I've updated the 
> command to work whether now is initially set or not. 

Ok. I agree with your update. 
 
> Also, there is the issue of connection failures: the attached version adds 
> an error message and exit for initial connections consistently.
> This is not done with is_connect, though, and I'm unsure what we should 
> really do.

Well, as to connection failures, I think that we should discuss in the other
thread [1] where this issue was originally raised or in a new thread because
we can discuss this as a separated issue from the originally proposed patch.

[1] 
https://www.postgresql.org/message-id/flat/TYCPR01MB5870057375ACA8A73099C649F5349%40TYCPR01MB5870.jpnprd01.prod.outlook.com.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 




Re: Confused by the default privilege

2021-06-15 Thread 孙冰
Gee, I pasted the ending demonstration as html.

Re-pasting a text version.

--


┌
│ drop owned by owner;
│ drop role if exists owner, guest;
│
│ create role owner;
│ create role guest;
│
│ drop schema if exists s;
│ create schema if not exists s authorization owner;
└

DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA


1 tables


1.1 no-op


  ┌
  │ set role to owner;
  │ create or replace view s.v1 as select 1;
  └

  ┌
  │ \dp+ s.v1
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  
   s   v1view
  

  ┌
  │ select * from information_schema.role_table_grants where
table_name='v1';
  └


━
   grantor  grantee  table_catalog  table_schema  table_name
 privilege_type  is_grantable  with_hierarchy

─
   ownerownerpostgres   s v1  INSERT
   YES   NO
   ownerownerpostgres   s v1  SELECT
   YES   YES
   ownerownerpostgres   s v1  UPDATE
   YES   NO
   ownerownerpostgres   s v1  DELETE
   YES   NO
   ownerownerpostgres   s v1  TRUNCATE
   YES   NO
   ownerownerpostgres   s v1  REFERENCES
   YES   NO
   ownerownerpostgres   s v1  TRIGGER
  YES   NO

━

  ┌
  │ set role to owner;
  │ select * from s.v1;
  └

  ━━
   ?column?
  ──
  1
  ━━


1.2 default privilege: `revoke all from owner'
───

  ┌
  │ alter default privileges for user owner revoke all on tables from owner;
  │ \ddp+
  └

  ━
   Owner  Schema  Type   Access privileges
  ─
   owner  table
  ━

  ┌
  │ set role to owner;
  │ create or replace view s.v2 as select 1;
  └

  ┌
  │ \dp+ s.v2
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  
   s   v2view
  

  ┌
  │ select * from information_schema.role_table_grants where
table_name='v2';
  └


━
   grantor  grantee  table_catalog  table_schema  table_name
 privilege_type  is_grantable  with_hierarchy

─
   ownerownerpostgres   s v2  INSERT
   YES   NO
   ownerownerpostgres   s v2  SELECT
   YES   YES
   ownerownerpostgres   s v2  UPDATE
   YES   NO
   ownerownerpostgres   s v2  DELETE
   YES   NO
   ownerownerpostgres   s v2  TRUNCATE
   YES   NO
   ownerownerpostgres   s v2  REFERENCES
   YES   NO
   ownerownerpostgres   s v2  TRIGGER
  YES   NO

━

  ┌
  │ set role to owner;
  │ select * from s.v2;
  └

  ━━
   ?column?
  ──
  1
  ━━


1.3 default privilege: `revoke all but one from owner'
───

  ┌
  │ alter default privileges for user owner revoke all on tables from owner;
  │ alter default privileges for user owner grant trigger on tables to
owner;
  │ \ddp+
  └

  ━
   Owner  Schema  Type   Access privileges
  ─
   owner  table  owner=t/owner
  ━

  ┌
  │ set role to owner;
  │ create or replace view s.v3 as select 1;
  └

  ┌
  │ \dp+ s.v3
  └

  
   Schema  Name  Type  Access privileges  Column privileges  Policies
  

Confused by the default privilege

2021-06-15 Thread 孙冰
Hi,

My use case is to create an isolated interface schema consisting of only
views and functions (possibly many schemas, for multi-tenancy or
multi-version), which has the minimal access exposure. To reduce the mental
and maintenance burden, I am inclined to create one role per interface
schema, instead of creating separate roles for the owner and the user. As a
consequence, the default privileges must be revoked from the owner.
Explicit revocation works just fine, except that it requires repetitive and
forgettable statements for each object in the schema.

The default privileges come to rescue. It mostly works, despite a bit of
confusion to me.

The ending contents are some experiments and demonstrations. To sum up, I
have to either leave some non-critical privileges (e.g., trigger,
references) by the default privilege mechanism or manually revoke all
privileges, to stop the owner having all the default privileges. Plus, the
first alternative is not applicable to functions because there is only one
privilege for functions (execute).

To me, it is confusing and less intuitive. Or is there something I miss?

TL;DR
Revoking all default privileges is effectively equivalent to revoking
nothing, because an empty string of access privileges is handled as
'default'.

Maybe 'NULL' for 'default', and '' (empty string) means nothing?

Regards.


--

drop owned by owner;
drop role if exists owner, guest;

create role owner;
create role guest;

drop schema if exists s;
create schema if not exists s authorization owner;

DROP OWNED DROP ROLE CREATE ROLE CREATE ROLE DROP SCHEMA CREATE SCHEMA
1. tables
1.1. no-op

set role to owner;
create or replace view s.v1 as select 1;

\dp+ s.v1

Schema Name Type Access privileges Column privileges Policies
s v1 view

select * from information_schema.role_table_grants where table_name='v1';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v1 INSERT YES NO
owner owner postgres s v1 SELECT YES YES
owner owner postgres s v1 UPDATE YES NO
owner owner postgres s v1 DELETE YES NO
owner owner postgres s v1 TRUNCATE YES NO
owner owner postgres s v1 REFERENCES YES NO
owner owner postgres s v1 TRIGGER YES NO

set role to owner;
select * from s.v1;

?column?
1
1.2. default privilege: revoke all from owner

alter default privileges for user owner revoke all on tables from owner;
\ddp+

Owner Schema Type Access privileges
owner   table

set role to owner;
create or replace view s.v2 as select 1;

\dp+ s.v2

Schema Name Type Access privileges Column privileges Policies
s v2 view

select * from information_schema.role_table_grants where table_name='v2';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v2 INSERT YES NO
owner owner postgres s v2 SELECT YES YES
owner owner postgres s v2 UPDATE YES NO
owner owner postgres s v2 DELETE YES NO
owner owner postgres s v2 TRUNCATE YES NO
owner owner postgres s v2 REFERENCES YES NO
owner owner postgres s v2 TRIGGER YES NO

set role to owner;
select * from s.v2;

?column?
1
1.3. default privilege: revoke all but one from owner

alter default privileges for user owner revoke all on tables from owner;
alter default privileges for user owner grant trigger on tables to owner;
\ddp+

Owner Schema Type Access privileges
owner   table owner=t/owner

set role to owner;
create or replace view s.v3 as select 1;

\dp+ s.v3

Schema Name Type Access privileges Column privileges Policies
s v3 view owner=t/owner

select * from information_schema.role_table_grants where table_name='v3';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v3 TRIGGER YES NO

set role to owner;
select * from s.v3;

ERROR:  42501: permission denied for view v3
LOCATION:  aclcheck_error, aclchk.c:3461

1.4. manual revoke all from owner

alter default privileges for user owner revoke all on tables from owner;
\ddp+

Owner Schema Type Access privileges
owner   table

set role to owner;
create or replace view s.v4 as select 1;

\dp+ s.v4

Schema Name Type Access privileges Column privileges Policies
s v4 view

select * from information_schema.role_table_grants where table_name='v4';

grantor grantee table_catalog table_schema table_name privilege_type
is_grantable with_hierarchy
owner owner postgres s v4 INSERT YES NO
owner owner postgres s v4 SELECT YES YES
owner owner postgres s v4 UPDATE YES NO
owner owner postgres s v4 DELETE YES NO
owner owner postgres s v4 TRUNCATE YES NO
owner owner postgres s v4 REFERENCES YES NO
owner owner postgres s v4 TRIGGER YES NO

set role to owner;
select * from s.v4;

?column?
1

So far, the situation is identical to s.v2.

set role to owner;
revoke all on table s.v4 from owner;

\dp+ s.v4

Schema Name Type Access privileges Column privileges Policies
s v4 view

select 

Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Julien Rouhaud
On Tue, Jun 15, 2021 at 9:31 PM Andrew Dunstan  wrote:
>
> Rather than use size, I'd be inclined to say use this if the source
> database is marked as a template, and use the copydir approach for
> anything that isn't.

Looks like a good approach.




Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-06-15 Thread Robert Haas
On Tue, Jun 15, 2021 at 7:05 AM Amit Kapila  wrote:
> Okay, but I think if we go with your suggested model where whenever
> there is a change in parallel-safety of any function, we need to send
> the new invalidation then I think it won't matter whether the function
> is associated with index expression, check constraint in the table, or
> is used in any other way.

No, it will still matter, because I'm proposing that the
parallel-safety of functions that we only access via operator classes
will not even be checked. Also, if we decided to make the system more
fine-grained - e.g. by invalidating on the specific OID of the
function that was changed rather than doing something that is
database-wide or global - then it matters even more.

> Yeah, dealing with partitioned tables is tricky. I think if we don't
> want to check upfront the parallel safety of all the partitions then
> the other option as discussed could be to ask the user to specify the
> parallel safety of partitioned tables.

Just to be clear here, I don't think it really matters what we *want*
to do. I don't think it's reasonably *possible* to check all the
partitions, because we don't hold locks on them. When we're assessing
a bunch of stuff related to an individual relation, we have a lock on
it. I think - though we should double-check tablecmds.c - that this is
enough to prevent all of the dependent objects - triggers,
constraints, etc. - from changing. So the stuff we care about is
stable. But the situation with a partitioned table is different. In
that case, we can't even examine that stuff without locking all the
partitions. And even if we do lock all the partitions, the stuff could
change immediately afterward and we wouldn't know. So I think it would
be difficult to make it correct.

Now, maybe it could be done, and I think that's worth a little more
thought. For example, perhaps whenever we invalidate a relation, we
could also somehow send some new, special kind of invalidation for its
parent saying, essentially, "hey, one of your children has changed in
a way you might care about." But that's not good enough, because it
only goes up one level. The grandparent would still be unaware that a
change it potentially cares about has occurred someplace down in the
partitioning hierarchy. That seems hard to patch up, again because of
the locking rules. The child can know the OID of its parent without
locking the parent, but it can't know the OID of its grandparent
without locking its parent. Walking up the whole partitioning
hierarchy might be an issue for a number of reasons, including
possible deadlocks, and possible race conditions where we don't emit
all of the right invalidations in the face of concurrent changes. So I
don't quite see a way around this part of the problem, but I may well
be missing something. In fact I hope I am missing something, because
solving this problem would be really nice.

> We can additionally check the
> parallel safety of partitions when we are trying to insert into a
> particular partition and error out if we detect any parallel-unsafe
> clause and we are in parallel-mode. So, in this case, we won't be
> completely relying on the users. Users can either change the parallel
> safe option of the table or remove/change the parallel-unsafe clause
> after error. The new invalidation message as we are discussing would
> invalidate the parallel-safety for individual partitions but not the
> root partition (partitioned table itself). For root partition, we will
> rely on information specified by the user.

Yeah, that may be the best we can do. Just to be clear, I think we
would want to check whether the relation is still parallel-safe at the
start of the operation, but not have a run-time check at each function
call.

> I am not sure if we have a simple way to check the parallel safety of
> partitioned tables. In some way, we need to rely on user either (a) by
> providing an option to specify whether parallel Inserts (and/or other
> DMLs) can be performed, or (b) by providing a guc and/or rel option
> which indicate that we can check the parallel-safety of all the
> partitions. Yet another option that I don't like could be to
> parallelize inserts on non-partitioned tables.

If we figure out a way to check the partitions automatically that
actually works, we don't need a switch for it; we can (and should)
just do it that way all the time. But if we can't come up with a
correct algorithm for that, then we'll need to add some kind of option
where the user declares whether it's OK.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




SSL/TLS instead of SSL in docs

2021-06-15 Thread Daniel Gustafsson
In the NSS thread it was discussed (20210603210642.gf22...@momjian.us etc) that
we use SSL rather than TLS in the documentation, which is technically somewhat
incorrect.  Consensus came to using SSL/TLS instead for referring to encrypted
connections.  Since this isn't really limited to the NSS work, I'm breaking
this out into a new thread.

Looking at the docs it turns out that we have a mix of SSL (with one ssl),
SSL/TLS and TLS for referring to the same thing.  The attached changes the
documentation to consistently use SSL/TLS when referring to an encrypted
connection using a TLS protocol, leaving bare SSL and TLS only for referring to
the actual protocols.  I *think* I found all instances, there are many so I
might have missed some, but this version seemed like a good place to continue
the discussion from the previous thread.

Admittedly it gets pretty unwieldy with the  markup on SSL and TLS
but I opted for being consistent, since I don't know of any rules for when it
can/should be omitted (and it seems quite arbitrary right now).  Mentions in
titles were previously not marked up so I've left those as is.  I've also left
line breaks as an excercise for later to make the diff more readable.

While in there I added IMO missing items to the glossary and acronyms sections
as well as fixed up markup around OpenSSL.

This only deals with docs, but if this is deemed interesting then userfacing
messages in the code should use SSL/TLS as well of course.

Thoughts?

--
Daniel Gustafsson   https://vmware.com/



v1-0003-docs-Consistent-OpenSSL-markup.patch
Description: Binary data


v1-0002-docs-Replace-usage-of-SSL-with-SSL-TLS.patch
Description: Binary data


v1-0001-docs-SSL-TLS-related-acronyms-and-glossary.patch
Description: Binary data


Re: unnesting multirange data types

2021-06-15 Thread Tom Lane
Alexander Korotkov  writes:
> Pushed!  Thanks to thread participants for raising this topic and
> review.  I'll be around to resolve issues if any.

Buildfarm is pretty thoroughly unhappy.  Did you do a "check-world"
before pushing?

regards, tom lane




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Andrew Dunstan


On 6/15/21 8:04 AM, Heikki Linnakangas wrote:
>
> Yeah, WAL-logging the contents of the source database would certainly
> be less weird than the current system. As Julien also pointed out, the
> question is, are there people using on "CREATE DATABASE foo TEMPLATE
> bar" to copy a large source database, on the premise that it's fast
> because it skips WAL-logging?


I'm 100% certain there are. It's not even a niche case.


>
> In principle, we could have both mechanisms, and use the new
> WAL-logged system if the database is small, and the old system with
> checkpoints if it's large. But I don't like idea of having to maintain
> both.
>
>

Rather than use size, I'd be inclined to say use this if the source
database is marked as a template, and use the copydir approach for
anything that isn't.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread Isaac Morland
On Tue, 15 Jun 2021 at 05:24,  wrote:

> >> I don't see the need to extend the unit to YB.
> >> What use case do you have in mind?
> >
> >Practical or no, I saw no reason not to support all defined units. I
> assume we’ll
> >get to a need sooner or later. :)
>
> Thank you for your reply!
> Hmmm, I didn't think YB was necessary, but what do others think?
>

If I’m reading the code correctly, the difference between supporting YB and
not supporting it is whether there is a line for it in the list of prefixes
and their multiples. As such, I don’t see why we’re even discussing whether
or not to include all the standard prefixes. A YB is still an absurd amount
of storage, but that’s not the point; just put all the standard prefixes
and be done with it. If actual code changes were required in the new code
as they are in the old it might be worth discussing.

One question: why is there no “k” in the list of prefixes?

Also: why not have only the prefixes in the array, and use a single fixed
output format "%s %sB" all the time?

It feels like it should be possible to calculate the appropriate idx to use
(while adjusting the number to print as is done now) and then just have one
psprintf call for all cases.

A more significant question is YB vs. YiB. I know there is a long tradition
within computer-related fields of saying that k = 1024, M = 1024^2, etc.,
but we’re not special enough to override the more general principles of SI
(Système International) which provide that k = 1000, M = 1000^2 and so on
universally and provide the alternate prefixes ki, Mi, etc. which use 1024
as the multiple.

So I would suggest either display 200 as 2MB or as 1.907MiB.


Re: [PATCH] expand the units that pg_size_pretty supports on output

2021-06-15 Thread David Rowley
On Tue, 15 Jun 2021 at 21:24,  wrote:
> Hmmm, I didn't think YB was necessary, but what do others think?

For me personally, without consulting Wikipedia, I know that Petabyte
comes after Terabyte and then I'm pretty sure it's Exabyte.  After
that, I'd need to check.

Assuming I'm not the only person who can't tell exactly how many bytes
are in a Yottabyte, would it actually be a readability improvement if
we started showing these units to people?

I'd say there might be some argument to implement as far as PB one
day, maybe not that far out into the future, especially if we got
something like built-in clustering. But I just don't think there's any
need to go all out and take it all the way to YB.  There's an above
zero chance we'll break something of someones by doing this, so I
think any changes here should be driven off an actual requirement.

I really think this change is more likely to upset someone than please someone.

Just my thoughts.

David




Re: Case expression pushdown

2021-06-15 Thread Ashutosh Bapat
Looks quite useful to me. Can you please add this to the next commitfest?

On Wed, Jun 9, 2021 at 5:25 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> This patch allows pushing case expressions to foreign servers, so that
> more types of updates could be executed directly.
>
> For example, without patch:
>
> EXPLAIN (VERBOSE, COSTS OFF)
> UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
> WHERE c1 > 1000;
>QUERY PLAN
> ---
>Update on public.ft2 d
> Remote SQL: UPDATE "S 1"."T 1" SET c2 = $2 WHERE ctid = $1
> ->  Foreign Scan on public.ft2 d
>   Output: CASE WHEN (c2 > 0) THEN c2 ELSE 0 END, ctid, d.*
>   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8, ctid FROM
> "S 1"."T 1" WHERE (("C 1" > 1000)) FOR UPDATE
>
>
> EXPLAIN (VERBOSE, COSTS OFF)
> UPDATE ft2 d SET c2 = CASE WHEN c2 > 0 THEN c2 ELSE 0 END
> WHERE c1 > 1000;
> QUERY PLAN
> 
>   Update on public.ft2 d
> ->  Foreign Update on public.ft2 d
>   Remote SQL: UPDATE "S 1"."T 1" SET c2 = (CASE WHEN (c2 > 0)
> THEN c2 ELSE 0 END) WHERE (("C 1" > 1000))
>
>
> --
> Best regards,
> Alexander Pyhalov,
> Postgres Professional



-- 
Best Wishes,
Ashutosh Bapat




Re: [Proposal] Fully WAL logged CREATE DATABASE - No Checkpoints

2021-06-15 Thread Adam Brusselback
Am I mistaken in thinking that this would allow CREATE DATABASE to run
inside a transaction block now, further reducing the DDL commands that are
non-transactional?


Re: Function scan FDW pushdown

2021-06-15 Thread Ashutosh Bapat
Hi Alexander,

On Thu, May 20, 2021 at 11:13 PM Alexander Pyhalov
 wrote:
>
> Hi.
>
> The attached patch allows pushing joins with function RTEs to PostgreSQL
> data sources.
> This makes executing queries like this
>
> create foreign table f_pgbench_accounts (aid int, bid int, abalance int,
> filler char(84)) SERVER local_srv OPTIONS (table_name
> 'pgbench_accounts');
> select * from f_pgbench_accounts join unnest(array[1,2,3]) ON unnest =
> aid;
>

It will be good to provide some practical examples where this is useful.



> more efficient.
>
> with patch:
>
>
> So far I don't know how to visualize actual function expression used in
> function RTE, as in postgresExplainForeignScan() es->rtable comes from
> queryDesc->plannedstmt->rtable, and rte->functions is already 0.

The actual function expression will be part of the Remote SQL of
ForeignScan node so no need to visualize it separately.

The patch will have problems when there are multiple foreign tables
all on different servers or use different FDWs. In such a case the
function scan's RelOptInfo will get the fpinfo based on the first
foreign table the function scan is paired with during join planning.
But that may not be the best foreign table to join. We should be able
to plan all the possible joins. Current infra to add one fpinfo per
RelOptInfo won't help there. We need something better.

The patch targets only postgres FDW, how do you see this working with
other FDWs?

If we come up with the right approach we could use it for 1. pushing
down queries with IN () clause 2. joining a small local table with a
large foreign table by sending the local table rows down to the
foreign server.

-- 
Best Wishes,
Ashutosh Bapat




Re: doc issue missing type name "multirange" in chapter title

2021-06-15 Thread Pavel Stehule
út 15. 6. 2021 v 15:11 odesílatel Alexander Korotkov 
napsal:

> On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov 
> wrote:
> > On Fri, Jun 11, 2021 at 11:16 PM Tom Lane  wrote:
> > >
> > > Alexander Korotkov  writes:
> > > > What about "Range/Multirange Functions and Operators"?
> > >
> > > Better than a comma, I guess.  Personally I didn't have a
> > > problem with the form with two "ands".
> >
> > Thank you.  I propose to push the slash option because it both evades
> > double "and" and it's aligned with sibling section headers (we have
> > "Date/Time Functions and Operators").
> >
> > Any objections?
>
> I heard no objection.  So, pushed.
>

Thank you

Pavel


> --
> Regards,
> Alexander Korotkov
>


Re: doc issue missing type name "multirange" in chapter title

2021-06-15 Thread Alexander Korotkov
On Sun, Jun 13, 2021 at 2:48 PM Alexander Korotkov  wrote:
> On Fri, Jun 11, 2021 at 11:16 PM Tom Lane  wrote:
> >
> > Alexander Korotkov  writes:
> > > What about "Range/Multirange Functions and Operators"?
> >
> > Better than a comma, I guess.  Personally I didn't have a
> > problem with the form with two "ands".
>
> Thank you.  I propose to push the slash option because it both evades
> double "and" and it's aligned with sibling section headers (we have
> "Date/Time Functions and Operators").
>
> Any objections?

I heard no objection.  So, pushed.

--
Regards,
Alexander Korotkov




  1   2   >