Re: Issue with point_ops and NaN

2021-03-31 Thread Kyotaro Horiguchi
At Wed, 31 Mar 2021 15:46:16 +0900 (JST), Kyotaro Horiguchi 
 wrote in 
> At Wed, 31 Mar 2021 09:26:00 +0900, Michael Paquier  
> wrote in 
> > On Tue, Mar 30, 2021 at 11:39:40PM +0800, Julien Rouhaud wrote:
> > > On Tue, Mar 30, 2021 at 11:02:32AM -0400, Tom Lane wrote:
> > >> Agreed --- one could make an argument for either 'false' or NULL
> > >> result, but surely not 'true'.
> > > 
> > > I would think that it should return NULL since it's not inside nor 
> > > outside the
> > > polygon, but I'm fine with false.
> > 
> > Yeah, this is trying to make an undefined point fit into a box that
> > has a  definition, so "false" does not make sense to me either here as
> > it implies that the point exists?  NULL seems adapted here.
> 
> Sounds reasonable.  The function may return NULL for other cases so
> it's easily changed to NULL.
> 
> # But it's bothersome to cover all parallels..

Hmm. Many internal functions handles bool, which cannot handle the
case of NaN naturally.  In short, it's more invasive than expected.

> Does anyone oppose to make the case NULL?  If no one objects, I'll do
> that.

Mmm. I'd like to reduce from +1 to +0.7 or so, considering the amount
of needed work...

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Proposal: Save user's original authenticated identity for logging

2021-03-31 Thread Michael Paquier
On Tue, Mar 30, 2021 at 11:15:48PM +, Jacob Champion wrote:
> Rather than putting Postgres log data into the Perl logs, I rotate the
> logs exactly once at the beginning -- so that there's an
> old 001_ssltests_primary.log, and a new 001_ssltests_primary_1.log --
> and then every time we truncate the logfile, I shuffle the bits from
> the new logfile into the old one. So no one has to learn to find the
> log entries in a new place, we don't get an explosion of rotated logs,
> we don't lose the log data, we don't match incorrect portions of the
> logs, and we only pay the restart price once. This is wrapped into a
> small Perl module, LogCollector.

Hmm.  I have dug today into that and I am really not convinced that
this is necessary, as a connection attempt combined with the output
sent to stderr gives you the stability needed.  If we were to have
anything like that, perhaps a sub-class of PostgresNode would be
adapted instead, with an internal log integration.

After thinking about it, the new wording in config.sgml looks fine
as-is.

Anyway, I have not been able to convince myself that we need those
slowdowns and that many server restarts as there is no
reload-dependent timing here, and things have been stable on
everything I have tested (including a slow RPI).  I have found a
couple of things that can be simplified in the tests:
- In src/test/authentication/, except for the trust method where there
is no auth ID, all the other tests wrap a like() if $res == 0, or
unlike() otherwise.  I think that it is cleaner to make the matching
pattern an argument of test_role(), and adapt the tests to that.
- src/test/ldap/ can also embed a single logic within test_access().
- src/test/ssl/ is a different beast, but I think that there is more
refactoring possible here in parallel of the recent work I have sent
to have equivalents of test_connect_ok() and test_connect_fails() in
PostgresNode.pm.  For now, I think that we should just live in this
set with a small routine able to check for pattern matches in the
logs.

Attached is an updated patch, with a couple of comments tweaks, the
reworked tests and an indentation done.
--
Michael
From 0f308c892cb4bb120572e996b1eb612682134c03 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 31 Mar 2021 16:05:33 +0900
Subject: [PATCH v15] Log authenticated identity from all auth backends

The "authenticated identity" is the string used by an auth method to
identify a particular user. In many common cases this is the same as the
Postgres username, but for some third-party auth methods, the identifier
in use may be shortened or otherwise translated (e.g. through pg_ident
mappings) before the server stores it.

To help DBAs see who has actually interacted with the system, store the
original identity when authentication succeeds, and expose it via the
log_connections setting. The log entries look something like this
example (where a local user named "pchampion" is connecting to the
database as the "admin" user):

LOG:  connection received: host=[local]
LOG:  connection authenticated: identity="pchampion" method=peer (/data/pg_hba.conf:88)
LOG:  connection authorized: user=admin database=postgres application_name=psql

port->authn_id is set according to the auth method:

bsd: the Postgres username (which is the local username)
cert: the client's Subject DN
gss: the user principal
ident: the remote username
ldap: the final bind DN
pam: the Postgres username (which is the PAM username)
password (and all pw-challenge methods): the Postgres username
peer: the peer's pw_name
radius: the Postgres username (which is the RADIUS username)
sspi: either the down-level (SAM-compatible) logon name, if
  compat_realm=1, or the User Principal Name if compat_realm=0

The trust auth method does not set an authenticated identity. Neither
does using clientcert=verify-full (use the cert auth method instead).

The Kerberos test suite has been modified to let tests match multiple
log lines.
---
 src/include/libpq/hba.h   |   1 +
 src/include/libpq/libpq-be.h  |  13 +++
 src/backend/libpq/auth.c  | 136 --
 src/backend/libpq/hba.c   |  24 
 src/test/authentication/t/001_password.pl |  76 +---
 src/test/kerberos/t/001_auth.pl   |  47 ++--
 src/test/ldap/t/001_auth.pl   |  42 +--
 src/test/ssl/t/001_ssltests.pl|  61 +-
 src/test/ssl/t/002_scram.pl   |  18 ++-
 doc/src/sgml/config.sgml  |   3 +-
 10 files changed, 381 insertions(+), 40 deletions(-)

diff --git a/src/include/libpq/hba.h b/src/include/libpq/hba.h
index 1ec8603da7..63f2962139 100644
--- a/src/include/libpq/hba.h
+++ b/src/include/libpq/hba.h
@@ -137,6 +137,7 @@ typedef struct Port hbaPort;
 
 extern bool load_hba(void);
 extern bool load_ident(void);
+extern const char *hba_authname(hbaPort *port);
 extern void 

Re: pgbench - add pseudo-random permutation function

2021-03-31 Thread Fabien COELHO



Hello Dean,


First, I have a thing against erand48.


Yeah, that's probably a fair point. However, all the existing pgbench 
random functions are using it, so I think it's fair enough for permute() 
to do the same (and actually 2^48 is pretty huge). Switching to a 64-bit 
PRNG might not be a bad idea, but I think that's something we'd want to 
do across the board, and so I think it should be out of scope for this 
patch.


But less likely to pass, whereas here we have an internal function that 
we can set as we want.


Also, there is a 64 bits seed provided to the function which instantly 
ignores 16 of them, which looks pretty silly to me.


Also, the function is named everywhere erand48 with its hardcoded int16[3] 
state, which makes a poor abstraction.


At least, I suggest that two 48-bits prng could be initialized with parts 
of the seed and used in different places, eg for r & m.


Also, the seed could be used to adjust the rotation, maybe.


I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of
values are kept out of STEERING. [...]


Ah, that's a good point. Something else that also concerned me there was 
that it might lead to 2 consecutive full shifts with nothing in between, 
which would lead to less uniform randomness (like the Irwin-Hall 
distribution). I just did a quick test without the first full shift, and 
the results do appear to be better,


Indeed, it makes sense to me.


so removing that looks like a good idea.


Third, I think that the rotate code can be simplified, in particular 
the ?: should be avoided because it may induce branches quite damaging 
to processor performance.


Yeah, I wondered about that. Perhaps there's a "trick" that can be
used to simplify it. Pre-computing the number of bits in the mask
would probably help.


See pg_popcount64().

--
Fabien.




Re: Issue with point_ops and NaN

2021-03-31 Thread Julien Rouhaud
On Wed, Mar 31, 2021 at 03:48:16PM +0900, Kyotaro Horiguchi wrote:
> 
> Thanks! However, Michael's suggestion is worth considering.  What do
> you think about makeing NaN-involved comparison return NULL?  If you
> agree to that, I'll make a further change to the patch.

As I mentioned in [1] I think that returning NULL would the right thing to do.
But you mentioned elsewhere that it would need a lot more work to make the code
work that way, so given that we're 7 days away from the feature freeze maybe
returning false would be a better option.  One important thing to consider is
that we should consistently return NULL for similar cases, and having some
discrepancy there would be way worse than returning false everywhere.

[1] https://www.postgresql.org/message-id/20210330153940.vmncwnmuw3qnpkfa@nol




Re: locking [user] catalog tables vs 2pc vs logical rep

2021-03-31 Thread Ajin Cherian
On Tue, Mar 16, 2021 at 1:36 AM vignesh C  wrote:

> On Tue, Feb 23, 2021 at 3:59 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > The 2pc decoding added in
> >
> > commit a271a1b50e9bec07e2ef3a05e38e7285113e4ce6
> > Author: Amit Kapila 
> > Date:   2021-01-04 08:34:50 +0530
> >
> > Allow decoding at prepare time in ReorderBuffer.
> >
> > has a deadlock danger when used in a way that takes advantage of
> > separate decoding of the 2PC PREPARE.
> >
> >
> > I assume the goal of decoding the 2PC PREPARE is so one can wait for the
> > PREPARE to have logically replicated, before doing the COMMIT PREPARED.
> >
> >
> > However, currently it's pretty easy to get into a state where logical
> > decoding cannot progress until the 2PC transaction has
> > committed/aborted. Which essentially would lead to undetected deadlocks.
> >
> > The problem is that catalog tables accessed during logical decoding need
> > to get locked (otherwise e.g. a table rewrite could happen
> > concurrently). But if the prepared transaction itself holds a lock on a
> > catalog table, logical decoding will block on that lock - which won't be
> > released until replication progresses. A deadlock.
> >
> > A trivial example:
> >
> > SELECT pg_create_logical_replication_slot('test', 'test_decoding');
> > CREATE TABLE foo(id serial primary key);
> > BEGIN;
> > LOCK pg_class;
> > INSERT INTO foo DEFAULT VALUES;
> > PREPARE TRANSACTION 'foo';
> >
> > -- hangs waiting for pg_class to be unlocked
> > SELECT pg_logical_slot_get_changes('test', NULL, NULL,
> 'two-phase-commit', '1');
> >
> >
> > Now, more realistic versions of this scenario would probably lock a
> > 'user catalog table' containing replication metadata instead of
> > pg_class, but ...
> >
> >
> > At first this seems to be a significant issue. But on the other hand, if
> > you were to shut the cluster down in this situation (or disconnect all
> > sessions), you have broken cluster on your hand - without logical
> > decoding being involved. As it turns out, we need to read pg_class to
> > log in...  And I can't remember this being reported to be a problem?
> >
> >
> > Perhaps all that we need to do is to disallow 2PC prepare if [user]
> > catalog tables have been locked exclusively? Similar to how we're
> > disallowing preparing tables with temp table access.
> >
>
> Even I felt we should not allow prepare a transaction that has locked
> system tables, as it does not allow creating a new session after
> restart and also causes the deadlock while logical decoding of
> prepared transaction.
> I have made a patch to make the prepare transaction fail in this
> scenario. Attached the patch for the same.
> Thoughts?
>
>
The patch applies fine on HEAD and "make check" passes fine. No major
comments on the patch, just a minor comment:

If you could change the error from, " cannot PREPARE a transaction that has
a lock on user catalog/system table(s)"
to "cannot PREPARE a transaction that has an *exclusive lock* on user
catalog/system table(s)" that would be a more
accurate instruction to the user.

regards,
Ajin Cherian
Fujitsu Australia


Re: [PATCH] add concurrent_abort callback for output plugin

2021-03-31 Thread Ajin Cherian
On Wed, Mar 31, 2021 at 5:25 PM Markus Wanner <
markus.wan...@enterprisedb.com> wrote:

>
> The last sentences there now seems to relate to just the setting of
> "concurrent_abort", rather than the whole reason to invoke the
> prepare_cb.  And the reference to the "gid" is a bit lost.  Maybe:
>
> "Thus even in case of a concurrent abort, enough information is
>  provided to the output plugin for it to properly deal with the
>  ROLLBACK PREPARED once that is decoded."
>
> Alternatively, state that the gid is otherwise missing earlier in the
> docs (similar to how the commit message describes it).
>
>
> I'm fine with Amit's changes and like Markus's last suggestion as well.

regards,
Ajin Cherian
Fujitsu Australia


Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Joel Jacobson
On Wed, Mar 31, 2021, at 08:18, Julien Rouhaud wrote:
> On Wed, Mar 31, 2021 at 12:50:19AM +0200, Joel Jacobson wrote:
> > On Tue, Mar 30, 2021, at 22:01, Isaac Morland wrote:
> > > On Tue, 30 Mar 2021 at 15:33, Joel Jacobson  > > > wrote:
> > >>> Also, should the join be a left join, which would therefore return a 
> > >>> NULL when there is no matching record? Or could we have a variation 
> > >>> such as ->? to give a left join (NULL when no matching record) with -> 
> > >>> using an inner join (record is not included in result when no matching 
> > >>> record).
> > >> 
> > >> Interesting idea, but I think we can keep it simple, and still support 
> > >> the case you mention:
> > >> 
> > >> If we only have -> and you want to exclude records where the column is 
> > >> NULL (i.e. INNER JOIN),
> > >> I think we should just use the WHERE clause and filter on such condition.
> > >> 
> > > 
> > > Just to be clear, it will always be a left join? Agreed that getting the 
> > > inner join behaviour can be done in the WHERE clause. I think this is a 
> > > case where simple is good. As long as the left join case is supported I'm 
> > > happy.
> > 
> > Hmm, I guess, since technically, if all foreign key column(s) are declared 
> > as NOT NULL, we would know for sure such values exist, so a LEFT JOIN and 
> > INNER JOIN would always produce the same result.
> > I'm not sure if the query planner could produce different plans though, and 
> > if an INNER JOIN could be more efficient. If it matters, then I think we 
> > should generate an INNER JOIN for the "all column(s) NOT NULL" case.
> 
> I'm not sure who is supposed to be the target for this proposal.
> 
> As far as I understand this won't change the fact that users will still have 
> to
> understand the "relational" part of RDBMS, understand what is a JOIN
> cardinality and everything that comes with it.  So you think that people who
> are too lazy to learn the proper JOIN syntax will still bother to learn about
> relational algebra and understand what they're doing, and I'm very doubtful
> about that.
> 
> You also think that writing a proper JOIN is complex, but somehow writing a
> proper WHERE clause to subtly change the query behavior is not a problem, or
> that if users want to use aggregate or anything more complex then they'll
> happily open the documentation and learn how to do that.  In my experience 
> what
> will happen is that instead users will keep using that limited subset of SQL
> features and build creative and incredibly inefficient systems to avoid using
> anything else and will then complain that postgres is too slow.

Thanks for interesting new insights and questions.

Traditional SQL JOINs reveals less information about the data model,
compared to this new proposed foreign key based syntax.

Traditional SQL JOINs => undirected graph can be inferred
Foreign key joins => directed graph can be inferred

When looking at a traditional join, you might be able to guess the direction,
based on the name of tables and columns, but you cannot know for sure without
looking at the table definitions.

I'm thinking the target is both expert as well as beginner users,
who prefer a more concise syntax and reduced cognitive load:
 
Imagine a company with two types of SQL users:
1) Tech core team, responsible for schema changes (DDL), such as adding new 
tables/columns
and adding proper foreign keys.
2) Normal users, responsible for writing SQL queries using the existing schema.

In such a scenario, (2) would use the foreign keys added by (1),
letting them focus on *what* to join and less on *how* to join,
all in line with the objectives of the declarative paradigm.

By using the foreign keys, it is guaranteed you cannot get an
accidental one-to-many join that would multiply the result set.

How many rows a certain big query with lots of joins returns
can be difficult to reason about, you need to carefully inspect each
table to understand what column(s) there are unique constraints on,
that cannot multiply the result set.

If using the -> notation, you would only need to manually
inspect the tables involved in the remaining JOINs;
since you could be confident all uses of -> cannot affect cardinality.

I think this would be a win also for an expert SQL consultant working
with a new complex data model never seen before.

> 
> As an example just yesterday some user complained that it's not possible to
> write a trigger on a table that could intercept inserting a textual value on 
> an
> integer field and replace it with the referenced value.  And he rejected our
> suggested solution to replace the "INSERT INTO sometable VALUES..." with
> "INSERT INTO sometable SELECT ...".  And no this proposal would not have
> changed anything because changing the python script doing the import to add
> some minimal SQL knowledge was apparently too problematic.  Instead he will
> insert the data in a temporary table and dispatch everything on a per-

Shared buffers advice for windows in the wiki

2021-03-31 Thread talk to ben
Hello,

The wiki page [1] still mentions that : "On Windows the useful range (for
shared buffers) is 64MB to 512MB". The topic showed up in a pgtune
discussion [2].

Is it possible to remove this advice or add that since pg10 it no longer
holds true [3] ?

Benoit

[1] https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
[2] https://github.com/le0pard/pgtune/discussions/50
[3]
https://github.com/postgres/postgres/commit/81c52728f82be5303ea16508255e948017f4cd87#diff-0553fde8fa0de14527cb5c14b02adb28bdedf3e458f7347b51fb11e1dade2fa7


Re: Shared buffers advice for windows in the wiki

2021-03-31 Thread David Rowley
On Wed, 31 Mar 2021 at 22:39, talk to ben  wrote:
> Is it possible to remove this advice or add that since pg10 it no longer 
> holds true [3] ?

I've just removed all mention of it from the wiki.

David




Re: Lowering the ever-growing heap->pd_lower

2021-03-31 Thread Matthias van de Meent
On Wed, 31 Mar 2021 at 05:35, Peter Geoghegan  wrote:
>
> On Wed, Mar 10, 2021 at 6:01 AM Matthias van de Meent
>  wrote:
> > > The case I was concerned about back when is that there are various bits of
> > > code that may visit a page with a predetermined TID in mind to look at.
> > > An index lookup is an obvious example, and another one is chasing an
> > > update chain's t_ctid link.  You might argue that if the tuple was dead
> > > enough to be removed, there should be no such in-flight references to
> > > worry about, but I think such an assumption is unsafe.  There is not, for
> > > example, any interlock that ensures that a process that has obtained a TID
> > > from an index will have completed its heap visit before a VACUUM that
> > > subsequently removed that index entry also removes the heap tuple.
> >
> > I am aware of this problem. I will admit that I did not detected all
> > potentially problematic accesses, so I'll show you my work.
>
> Attached is a trivial rebase of your v3, which I've called v4. I am
> interested in getting this patch into Postgres 14.

Thanks, and that'd be great! PFA v5, which fixes 1 issue later
mentioned, and adds some comments on existing checks that are now in a
critical path.

> > In my search for problematic accesses, I make the following assumptions:
> > * PageRepairFragmentation as found in bufpage is only applicable to
> > heap pages; this function is not applied to other pages in core
> > postgres. So, any problems that occur are with due to access with an
> > OffsetNumber > PageGetMaxOffsetNumber.
> > * Items [on heap pages] are only extracted after using PageGetItemId
> > for that item on the page, whilst holding a lock.
>
> > The 3 problem cases were classified based on the origin of the
> > potentially invalid pointer.
> >
> > Problem case 1: table scan; heapam.c lines 678 and 811, in heapgettup
>
> I think that it boils down to this: 100% of the cases where this could
> be a problem all either involve old TIDs, or old line pointer -- in
> principle these could be invalidated in some way, like your backwards
> scan example. But that's it. Bear in mind that we always call
> PageRepairFragmentation() with a super-exclusive lock.

Yeah, that's the gist of what I found out. All accesses using old line
pointers need revalidation, and there were some cases in which this
was not yet done correctly.

> > This is in the replay of transaction logs, in heap_xlog_freeze_page.
> > As I am unaware whether or not pages to which these transaction logs
> > are applied can contain changes from the xlog-generating instance, I
> > flagged this as a potential problem. The application of the xlogs is
> > probably safe (it assumes the existence of a HeapTupleHeader for that
> > ItemId), but my limited knowledge put this on the 'potential problem'
> > list.
> >
> > Please advise on this; I cannot find the right documentation
>
> Are you aware of wal_consistency_checking?

I was vaguely aware that an option with that name exists, but that was
about the extent. Thanks for pointing me in that direction.

> I think that this should be fine. There are differences that are
> possible between a replica and primary, but they're very minor and
> don't seem relevant.

OK, then I'll assume that WAL replay shouldn't cause problems here.

> > Problem case 3: invalid redirect pointers; pruneheap.c line 949, in
> > heap_get_root_tuples
> >
> > The heap pruning mechanism currently assumes that all redirects are
> > valid. Currently, if a HOT root points to !ItemIdIsNormal, it breaks
> > out of the loop, but doesn't actually fail. This code is already
> > potentially problematic because it has no bounds or sanity checking
> > for the nextoffnum, but with shrinking pd_linp it would also add the
> > failure mode of HOT tuples pointing into what is now arbitrary data.
>
> heap_prune_chain() is less trusting than heap_get_root_tuples(),
> though -- it doesn't trust redirects (because there is a generic
> offnum sanity check at the start of its loop). I think that the
> inconsistency between these two functions probably isn't hugely
> significant.
>
> Ideally it would be 100% clear which of the defenses in code like this
> is merely extra hardening. The assumptions should be formalized. There
> is nothing wrong with hardening, but we should know it when we see it.

I realized one of my Assert()s was incorrectly asserting an actually
valid page state, so I've updated and documented that case.

> > This failure mode is now accompanied by an Assert, which fails when
> > the redirect is to an invalid OffsetNumber. This is not enough to not
> > exhibit arbitrary behaviour when accessing corrupted data with
> > non-assert builds, but I think that that is fine; we already do not
> > have a guaranteed behaviour for this failure mode.
>
> What about my "set would-be LP_UNUSED space to all-0x7F bytes in
> CLOBBER_FREED_MEMORY builds" idea? Did you think about that?

I had implemented it locally, but was waiting 

Re: Asynchronous Append on postgres_fdw nodes.

2021-03-31 Thread Etsuro Fujita
On Tue, Mar 30, 2021 at 8:40 PM Etsuro Fujita  wrote:
> I'm happy with the patch, so I'll commit it if there are no objections.

Pushed.

Best regards,
Etsuro Fujita




Re: Issue with point_ops and NaN

2021-03-31 Thread Laurenz Albe
On Wed, 2021-03-31 at 15:48 +0900, Kyotaro Horiguchi wrote:
> > > > > > SELECT point('NaN','NaN') <@ polygon('(0,0),(1,0),(1,1),(0,0)');
> > > > > > ?column? 
> > > > > > --
> > > > > >   t
> > > > > >   (1 row)
> > > > 
> > > > Agreed --- one could make an argument for either 'false' or NULL
> > > > result, but surely not 'true'.
> 
> Thanks! However, Michael's suggestion is worth considering.  What do
> you think about makeing NaN-involved comparison return NULL?  If you
> agree to that, I'll make a further change to the patch.

If you think of "NaN" literally as "not a number", then FALSE would
make sense, since "not a number" implies "not a number between 0 and 1".

But since NaN is the result of operations like 0/0 or infinity - infinity,
NULL might be better.

So I'd opt for NULL too.

Yours,
Laurenz Albe





Re: SQL-standard function body

2021-03-31 Thread Julien Rouhaud
On Tue, Mar 23, 2021 at 11:28:55PM -0500, Jaime Casanova wrote:
> On Fri, Mar 19, 2021 at 8:49 AM Peter Eisentraut
>  wrote:
> >
> > Right.  Here is a new patch with that fix added and a small conflict
> > resolved.
> 
> Great.
> 
> It seems print_function_sqlbody() is not protected to avoid receiving
> a function that hasn't a standard sql body in
> src/backend/utils/adt/ruleutils.c:3292, but instead it has an assert
> that gets hit with something like this:
> 
> CREATE FUNCTION foo() RETURNS int LANGUAGE SQL AS $$ SELECT 1 $$;
> SELECT pg_get_function_sqlbody('foo'::regproc);

It would also be good to add a regression test checking that we can't define a
function with both a prosrc and a prosqlbody.



@@ -76,6 +77,7 @@ ProcedureCreate(const char *procedureName,
Oid languageValidator,
const char *prosrc,
const char *probin,
+   Node *prosqlbody,
char prokind,
bool security_definer,
bool isLeakProof,
@@ -119,8 +121,6 @@ ProcedureCreate(const char *procedureName,
/*
 * sanity checks
 */
-   Assert(PointerIsValid(prosrc));
-
parameterCount = parameterTypes->dim1;


Shouldn't we still assert that we either have a valid procsrc or valid
prosqlbody?

No other comments apart from that!




Crash in record_type_typmod_compare

2021-03-31 Thread Sait Talha Nisanci
Hello,

In citus, we have seen the following crash backtraces because of a NULL 
tupledesc multiple times and we weren't sure if this was related to citus or 
postgres:


#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
417 tupdesc.c: No such file or directory.
#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
#1  0x0085b51f in record_type_typmod_compare (a=, 
b=, size=) at typcache.c:1761
#2  0x00869c73 in hash_search_with_hash_value (hashp=0x1c10530, 
keyPtr=keyPtr@entry=0x7ffcfd3117b8, hashvalue=3194332168, 
action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffcfd3117c0) at 
dynahash.c:987
#3  0x0086a3fd in hash_search (hashp=, 
keyPtr=keyPtr@entry=0x7ffcfd3117b8, action=action@entry=HASH_ENTER, 
foundPtr=foundPtr@entry=0x7ffcfd3117c0) at dynahash.c:911
#4  0x0085d0e1 in assign_record_type_typmod (tupDesc=, 
tupDesc@entry=0x1b9f3f0) at typcache.c:1801
#5  0x0061832b in BlessTupleDesc (tupdesc=0x1b9f3f0) at 
execTuples.c:2056
#6  TupleDescGetAttInMetadata (tupdesc=0x1b9f3f0) at execTuples.c:2081
#7  0x7f2701878dee in CreateDistributedExecution 
(modLevel=ROW_MODIFY_READONLY, taskList=0x1c82398, hasReturning=, paramListInfo=0x1c3e5a0, tupleDescriptor=0x1b9f3f0, tupleStore=, targetPoolSize=16, xactProperties=0x7ffcfd311960, jobIdList=0x0) at 
executor/adaptive_executor.c:951
#8  0x7f270187ba09 in AdaptiveExecutor (scanState=0x1b9eff0) at 
executor/adaptive_executor.c:676
#9  0x7f270187c582 in CitusExecScan (node=0x1b9eff0) at 
executor/citus_custom_scan.c:182
#10 0x0060c9e2 in ExecProcNode (node=0x1b9eff0) at 
../../../src/include/executor/executor.h:239
#11 ExecutePlan (execute_once=, dest=0x1abfc90, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1b9eff0, 
estate=0x1b9ed50) at execMain.c:1646
#12 standard_ExecutorRun (queryDesc=0x1c3e660, direction=, 
count=0, execute_once=) at execMain.c:364
#13 0x7f27018819bd in CitusExecutorRun (queryDesc=0x1c3e660, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
executor/multi_executor.c:177
#14 0x7f27000adfee in pgss_ExecutorRun (queryDesc=0x1c3e660, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:891
#15 0x0074f97d in PortalRunSelect (portal=portal@entry=0x1b8ed00, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x1abfc90) at pquery.c:929
#16 0x00750df0 in PortalRun (portal=portal@entry=0x1b8ed00, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=, dest=dest@entry=0x1abfc90, 
altdest=altdest@entry=0x1abfc90, completionTag=0x7ffcfd312090 "") at 
pquery.c:770
#17 0x0074e745 in exec_execute_message (max_rows=9223372036854775807, 
portal_name=0x1abf880 "") at postgres.c:2090
#18 PostgresMain (argc=, argv=argv@entry=0x1b4a0e8, 
dbname=, username=) at postgres.c:4308
#19 0x006de9d8 in BackendRun (port=0x1b37230, port=0x1b37230) at 
postmaster.c:4437
#20 BackendStartup (port=0x1b37230) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x006df955 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1aba280) at postmaster.c:1377
#23 0x00487a4e in main (argc=3, argv=0x1aba280) at main.c:228

This is the issue: https://github.com/citusdata/citus/issues/3825


I think this is related to postgres because of the following events:

  *   In 
assign_record_type_typmod
 tupledesc will be set to NULL if it is not in the cache and it will be set to 
an actual value in this 
line.
  *   It is possible that postgres will error in between these two lines, hence 
leaving a NULL tupledesc in the cache. For example in 
find_or_make_matching_shared_tupledesc.
 (Possibly because of OOM)
  *   Now there is a NULL tupledesc in the hash table, hence when doing a 
comparison in 
record_type_typmod_compare,
 it will crash.

I have manually added a line to error in 
"find_or_make_matching_shared_tupledesc" and I was able to get a similar crash 
with two subsequent simple SELECT queries. You can see the backtrace in the 
issue.

We should probably do 
HASH_ENTER
 only after we have a valid entry so that we don't end up with a NULL entry in 
the cache even if

Re: Failed assertion on standby while shutdown

2021-03-31 Thread Maxim Orlov

On 2021-03-30 20:44, Maxim Orlov wrote:
The following review has been posted through the commitfest 
application:

make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

All the tests passed successfully.

The new status of this patch is: Ready for Committer


The patch is good. One note, should we put a comment about 
ShutdownRecoveryTransactionEnvironment not reentrant behaviour? Or maybe 
rename it to ShutdownRecoveryTransactionEnvironmentOnce?


---
Best regards,
Maxim Orlov.




Re: "box" type description

2021-03-31 Thread Christoph Berg
Re: Kyotaro Horiguchi
> Returing to the description of pg_types, it should be changed like
> this following the discussion here.
> 
> - pg_catalog | box  | geometric box '(lower left,upper right)'
> + pg_catalog | box  | geometric box 'lower left,upper right'
> 
> But I find it hard to read. I fixed it instead as the following in the
> attached. However, it might rather be better not changing it..
> 
> + pg_catalog | box  | geometric box 'pt-lower-left,pt-upper-right'

I like that because it points to the "point" syntax so users can
figure out how to spell a box.

Christoph




Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Masahiko Sawada
On Wed, Mar 31, 2021 at 12:01 PM Peter Geoghegan  wrote:
>
> On Sun, Mar 28, 2021 at 9:16 PM Peter Geoghegan  wrote:
> > And now here's v8, which has the following additional cleanup:
>
> And here's v9, which has improved commit messages for the first 2
> patches, and many small tweaks within all 4 patches.
>
> The most interesting change is that lazy_scan_heap() now has a fairly
> elaborate assertion that verifies that its idea about whether or not
> the page is all_visible and all_frozen is shared by
> heap_page_is_all_visible() -- this is a stripped down version of the
> logic that now lives in lazy_scan_heap(). It exists so that the second
> pass over the heap can set visibility map bits.

Thank you for updating the patches.

Both 0001 and 0002 patch refactors the whole lazy vacuum code. Can we
merge them? I basically agree with the refactoring made by 0001 patch
but I'm concerned a bit that having such a large refactoring at very
close to feature freeze could be risky. We would need more eyes to
review during stabilization.

Here are some comments on 0001 patch:

-/*
- * Macro to check if we are in a parallel vacuum.  If true, we are in the
- * parallel mode and the DSM segment is initialized.
- */
-#define ParallelVacuumIsActive(lps) PointerIsValid(lps)
-

I think it's more clear to use this macro. The macro can be like this:

ParallelVacuumIsActive(vacrel) (((LVRelState) vacrel)->lps != NULL)

---
 /*
- * LVDeadTuples stores the dead tuple TIDs collected during the heap scan.
- * This is allocated in the DSM segment in parallel mode and in local memory
- * in non-parallel mode.
+ * LVDeadTuples stores TIDs that are gathered during pruning/the initial heap
+ * scan.  These get deleted from indexes during index vacuuming.  They're then
+ * removed from the heap during a second heap pass that performs heap
+ * vacuuming.
  */

The second sentence of the removed lines still seems to be useful
information for readers?

---
-*
-* Note that vacrelstats->dead_tuples
could have tuples which
-* became dead after HOT-pruning but
are not marked dead yet.
-* We do not process them because it's
a very rare condition,
-* and the next vacuum will process them anyway.

Maybe the above comments should not be removed by 0001 patch.

---
+   /* Free resources managed by lazy_space_alloc() */
+   lazy_space_free(vacrel);

and

+/* Free space for dead tuples */
+static void
+lazy_space_free(LVRelState *vacrel)
+{
+   if (!vacrel->lps)
+   return;
+
+   /*
+* End parallel mode before updating index statistics as we cannot write
+* during parallel mode.
+*/
+   end_parallel_vacuum(vacrel);

Looking at the comments, I thought that this function also frees
palloc'd dead tuple space but it doesn't. It seems to more clear that
doing pfree(vacrel->dead_tuples) here or not creating
lazy_space_free().

Also, the comment for end_paralle_vacuum() looks not relevant with
this function. Maybe we can update to:

/* Exit parallel mode and free the parallel context */

---
+   if (shared_istat)
+   {
+   /* Get the space for IndexBulkDeleteResult */
+   bulkdelete_res = &(shared_istat->istat);
+
+   /*
+* Update the pointer to the corresponding
bulk-deletion result if
+* someone has already updated it.
+*/
+   if (shared_istat->updated && istat == NULL)
+   istat = bulkdelete_res;
+   }

(snip)

+   if (shared_istat && !shared_istat->updated && istat != NULL)
+   {
+   memcpy(bulkdelete_res, istat, sizeof(IndexBulkDeleteResult));
+   shared_istat->updated = true;
+
+   /*
+* Now that top-level indstats[idx] points to the DSM
segment, we
+* don't need the locally allocated results.
+*/
+   pfree(istat);
+   istat = bulkdelete_res;
+   }
+
+   return istat;

If we have parallel_process_one_index() return the address of
IndexBulkDeleteResult, we can simplify the first part above. Also, it
seems better to use a separate variable from istat to store the
result. How about the following structure?

IndexBulkDeleteResult *istat_res;

/*
 * Update the pointer of the corresponding bulk-deletion result if
 * someone has already updated it.
 */
if (shared_istat && shared_istat->updated && istat == NULL)
istat = shared_istat->istat;

/* Do vacuum or cleanup of the index */
if (lvshared->for_cleanup)
istat_res = lazy_cleanup_one_index(indrel, istat, ...);
else
istat_res = lazy_vacuum_one_index(indrel, istat, ...);

/*
 * (snip)
 */
if (shared_istat && !shared_istat->updated && istat_res != NULL)
{
memcpy(

Re: locking [user] catalog tables vs 2pc vs logical rep

2021-03-31 Thread vignesh C
On Wed, Mar 31, 2021 at 2:35 PM Ajin Cherian  wrote:
>
> The patch applies fine on HEAD and "make check" passes fine. No major 
> comments on the patch, just a minor comment:
>
> If you could change the error from, " cannot PREPARE a transaction that has a 
> lock on user catalog/system table(s)"
> to "cannot PREPARE a transaction that has an exclusive lock on user 
> catalog/system table(s)" that would be a more
> accurate instruction to the user.
>

Thanks for reviewing the patch.
Please find the updated patch which includes the fix for the same.

Regards,
Vignesh
From 44c637668a3fc92c5ed9abc23d2c56df9116c6fb Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Mon, 15 Mar 2021 17:21:04 +0530
Subject: [PATCH v2] Fail prepared transaction if transaction has locked system
 tables/user catalog tables.

Don't allow PREPARE TRANSACTION if we've locked a system table or an user
catalog table in this transaction. Allowing the prepared  transactions when it
locked system tables will result in hang while logical decoding of prepared
transactions as logical decoding of prepared transactions locks the system
table.
---
 doc/src/sgml/ref/prepare_transaction.sgml |   8 ++
 src/backend/access/transam/xact.c |  14 +++
 src/backend/commands/lockcmds.c   |  17 +++
 src/include/access/xact.h |   7 ++
 src/test/regress/expected/prepared_xacts.out  | 102 
 .../regress/expected/prepared_xacts_1.out | 112 ++
 src/test/regress/sql/prepared_xacts.sql   |  75 
 7 files changed, 335 insertions(+)

diff --git a/doc/src/sgml/ref/prepare_transaction.sgml b/doc/src/sgml/ref/prepare_transaction.sgml
index f4f6118ac3..ee8e4072f5 100644
--- a/doc/src/sgml/ref/prepare_transaction.sgml
+++ b/doc/src/sgml/ref/prepare_transaction.sgml
@@ -106,6 +106,14 @@ PREPARE TRANSACTION transaction_id
tied to the current session to be useful in a transaction to be prepared.
   
 
+  
+   It is not currently allowed to PREPARE a transaction that
+   has locked system table(s) or user defined catalog table(s). These features
+   are too tightly tied to logical decoding of
+   PREPARE TRANSACTION and creation of new login session to
+   be useful in a transaction to be prepared.
+  
+
   
If the transaction modified any run-time parameters with SET
(without the LOCAL option),
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c83aa16f2c..00010c2bd2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2422,6 +2422,20 @@ PrepareTransaction(void)
 (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  errmsg("cannot PREPARE a transaction that has operated on temporary objects")));
 
+	/*
+	 * Don't allow PREPARE TRANSACTION if we've locked a system table or an user
+	 * catalog table in this transaction. Allowing the prepared transactions
+	 * when it locked system tables will result in deadlock while logical
+	 * decoding of prepared transactions as logical decoding of prepared
+	 * transactions locks the system table. Restarting the server when there is
+	 * a prepared transaction that has a lock on system table results in a
+	 * deadlock as we need to access pg_class during login.
+	 */
+	if (MyXactFlags & XACT_FLAGS_ACQUIREDEXCLUSIVELOCK_SYSREL)
+		ereport(ERROR,
+(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot PREPARE a transaction that has an exclusive lock on user catalog/system table(s)")));
+
 	/*
 	 * Likewise, don't allow PREPARE after pg_export_snapshot.  This could be
 	 * supported if we added cleanup logic to twophase.c, but for now it
diff --git a/src/backend/commands/lockcmds.c b/src/backend/commands/lockcmds.c
index 34f2270ced..b7848d3c6f 100644
--- a/src/backend/commands/lockcmds.c
+++ b/src/backend/commands/lockcmds.c
@@ -16,6 +16,7 @@
 
 #include "access/table.h"
 #include "access/xact.h"
+#include "catalog/catalog.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_inherits.h"
 #include "commands/lockcmds.h"
@@ -60,7 +61,23 @@ LockTableCommand(LockStmt *lockstmt)
 		if (get_rel_relkind(reloid) == RELKIND_VIEW)
 			LockViewRecurse(reloid, lockstmt->mode, lockstmt->nowait, NIL);
 		else if (recurse)
+		{
+			Relation	rel;
 			LockTableRecurse(reloid, lockstmt->mode, lockstmt->nowait);
+			rel = table_open(reloid, NoLock);
+
+			/*
+			 * Make note that we've locked a system table or an user catalog
+			 * table. This flag will be checked later during prepare transaction
+			 * to fail the prepare transaction.
+			 */
+			if (lockstmt->mode >= ExclusiveLock &&
+(IsCatalogRelationOid(reloid) ||
+ RelationIsUsedAsCatalogTable(rel)))
+MyXactFlags |= XACT_FLAGS_ACQUIREDEXCLUSIVELOCK_SYSREL;
+
+			table_close(rel, NoLock);
+		}
 	}
 }
 
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index f49a57b35e..f00a79706c 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -107,6 +107,13 @@ extern int	MyXa

Re: row filtering for logical replication

2021-03-31 Thread Amit Kapila
On Wed, Mar 31, 2021 at 7:17 AM Euler Taveira  wrote:
>
> On Tue, Mar 30, 2021, at 8:23 AM, Amit Kapila wrote:
>
> On Mon, Mar 29, 2021 at 6:47 PM Euler Taveira  wrote:
> >
> Few comments:
> ==
> 1. How can we specify row filters for multiple tables for a
> publication? Consider a case as below:
>
> It is not possible. Row filter is a per table option. Isn't it clear from the
> synopsis?
>

Sorry, it seems I didn't read it properly earlier, now I got it.

>
> 2.
> + /*
> + * Although ALTER PUBLICATION grammar allows WHERE clause to be specified
> + * for DROP TABLE action, it doesn't make sense to allow it. We implement
> + * this restriction here, instead of complicating the grammar to enforce
> + * it.
> + */
> + if (stmt->tableAction == DEFELEM_DROP)
> + {
> + ListCell   *lc;
> +
> + foreach(lc, stmt->tables)
> + {
> + PublicationTable *t = lfirst(lc);
> +
> + if (t->whereClause)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot use a WHERE clause when removing table from
> publication \"%s\"",
> + NameStr(pubform->pubname;
> + }
> + }
>
> Is there a reason to deal with this here separately rather than in the
> ALTER PUBLICATION grammar?
>
> Good question. IIRC the issue is that AlterPublicationStmt->tables has a list
> element that was a relation_expr_list and was converted to
> publication_table_list. If we share 'tables' with relation_expr_list (for 
> ALTER
> PUBLICATION ... DROP TABLE) and publication_table_list (for the other ALTER
> PUBLICATION ... ADD|SET TABLE), the OpenTableList() has to know what list
> element it is dealing with. I think I came to the conclusion that it is less
> uglier to avoid changing OpenTableList() and CloseTableList().
>
> [Doing some experimentation...]
>
> Here is a patch that remove the referred code.
>

Thanks, few more comments:
1. In pgoutput_change, we are always sending schema even though we
don't send actual data because of row filters. It may not be a problem
in many cases but I guess for some odd cases we can avoid sending
extra information.

2. In get_rel_sync_entry(), we are caching the qual for rel_sync_entry
even though we won't publish it which seems unnecessary?

3.
@@ -1193,5 +1365,11 @@ rel_sync_cache_publication_cb(Datum arg, int
cacheid, uint32 hashvalue)
  entry->pubactions.pubupdate = false;
  entry->pubactions.pubdelete = false;
  entry->pubactions.pubtruncate = false;
+
+ if (entry->qual != NIL)
+ list_free_deep(entry->qual);

Seeing one previous comment in this thread [1], I am wondering if
list_free_deep is enough here?

4. Can we write explicitly in the docs that row filters won't apply
for Truncate operation?

5. Getting some whitespace errors:
git am 
/d/PostgreSQL/Patches/logical_replication/row_filter/v14-0001-Row-filter-for-logical-replication.patch
.git/rebase-apply/patch:487: trailing whitespace.

warning: 1 line adds whitespace errors.
Applying: Row filter for logical replication


[1] - 
https://www.postgresql.org/message-id/20181123161933.jpepibtyayflz2xg%40alvherre.pgsql

-- 
With Regards,
Amit Kapila.




Re: cursor already in use, UPDATE RETURNING bug?

2021-03-31 Thread Ashutosh Bapat
On Wed, Mar 31, 2021 at 6:09 AM Jaime Casanova
 wrote:
>
> Hi,
>
> Just noted an interesting behaviour when using a cursor in a function
> in an UPDATE RETURNING (note that INSERT RETURNING has no problem).
>
> I have seen this problem in all versions I tested (9.4 thru master).
> Steps to reproduce:
>
> prepare the test
> ```
> create table t1 as select random() * foo i from generate_series(1, 100) foo;
> create table t2 as select random() * foo i from generate_series(1, 100) foo;
>
> CREATE OR REPLACE FUNCTION cursor_bug()
>  RETURNS integer
>  LANGUAGE plpgsql
> AS $function$
> declare
>   c1 cursor (p1 int) for select count(*) from t1 where i = p1;
>   n int4;
> begin
>   open c1 (77);
>   fetch c1 into n;
>   return n;
> end $function$
> ;
> ```
>
> -- this ends fine
> insert into t2 values(5) returning cursor_bug() as c1;
>  c1
> 
>   0
> (1 row)

cursor_bug() is called only once here.

>
> -- this fails
> update t2 set i = 5 returning cursor_bug() as c1;
> ERROR:  cursor "c1" already in use
> CONTEXT:  PL/pgSQL function cursor_bug() line 6 at OPEN

but that's called as many time as the number of rows in t2 in the same
transaction. The first row will go fine. For the second row it will
find c1 is already open. Shouldn't cursor_bug() close c1 at the end?
Is it intended to be kept open when the function finishes? May be you
are expecting it to be closed automatically when the function
finishes. But that's not what is documented at
https://www.postgresql.org/docs/13/plpgsql-cursors.html.

-- 
Best Wishes,
Ashutosh Bapat




Re: making update/delete of inheritance trees scale better

2021-03-31 Thread Amit Langote
On Tue, Mar 30, 2021 at 1:51 PM Tom Lane  wrote:
> Here's a v13 patchset that I feel pretty good about.

Thanks.  After staring at this for a day now, I do too.

> My original thought for replacing the "fake variable" design was to
> add another RTE holding the extra variables, and then have setrefs.c
> translate the placeholder variables to the real thing at the last
> possible moment.  I soon realized that instead of an actual RTE,
> it'd be better to invent a special varno value akin to INDEX_VAR
> (I called it ROWID_VAR, though I'm not wedded to that name).  Info
> about the associated variables is kept in a list of RowIdentityVarInfo
> structs, which are more suitable than a regular RTE would be.
>
> I got that and the translate-in-setrefs approach more or less working,
> but it was fairly messy, because the need to know about these special
> variables spilled into FDWs and a lot of other places; for example
> indxpath.c needed a special check for them when deciding if an
> index-only scan is possible.  What turns out to be a lot cleaner is
> to handle the translation in adjust_appendrel_attrs_mutator(), so that
> we have converted to real variables by the time we reach any
> relation-scan-level logic.
>
> I did end up having to break the API for FDW AddForeignUpdateTargets
> functions: they need to do things differently when adding junk columns,
> and they need different parameters.  This seems all to the good though,
> because the old API has been a backwards-compatibility hack for some
> time (e.g., in not passing the "root" pointer).

This all looks really neat.

I couldn't help but think that the RowIdentityVarInfo management code
looks a bit like SpecialJunkVarInfo stuff in my earliest patches, but
of course without all the fragility of assigning "fake" attribute
numbers to a "real" base relation(s).

> Some other random notes:
>
> * I was unimpressed with the idea of distinguishing different target
> relations by embedding integer constants in the plan.  In the first
> place, the implementation was extremely fragile --- there was
> absolutely NOTHING tying the counter you used to the subplans' eventual
> indexes in the ModifyTable lists.  Plus I don't have a lot of faith
> that setrefs.c will reliably do what you want in terms of bubbling the
> things up.  Maybe that could be made more robust, but the other problem
> is that the EXPLAIN output is just about unreadable; nobody will
> understand what "(0)" means.  So I went back to the idea of emitting
> tableoid, and installed a hashtable plus a one-entry lookup cache
> to make the run-time mapping as fast as I could.  I'm not necessarily
> saying that this is how it has to be indefinitely, but I think we
> need more work on planner and EXPLAIN infrastructure before we can
> get the idea of directly providing a list index to work nicely.

Okay.

> * I didn't agree with your decision to remove the now-failing test
> cases from postgres_fdw.sql.  I think it's better to leave them there,
> especially in the cases where we were checking the plan as well as
> the execution.  Hopefully we'll be able to un-break those soon.

Okay.

> * I updated a lot of hereby-obsoleted comments, which makes the patch
> a bit longer than v12; but actually the code is a good bit smaller.
> There's a noticeable net code savings in src/backend/optimizer/,
> which there was not before.

Agreed.  (I had evidently missed a bunch of comments referring to the
old ways of how inherited updates are performed.)

> I've not made any attempt to do performance testing on this,
> but I think that's about the only thing standing between us
> and committing this thing.

I reran some of the performance tests I did earlier (I've attached the
modified test running script for reference):

pgbench -n -T60 -M{simple|prepared} -f nojoin.sql

nojoin.sql:

\set a random(1, 100)
update test_table t set b = :a where a = :a;

...and here are the tps figures:

-Msimple

nparts  10cols  20cols  40cols

master:
64  10112   987810920
128 966210691   10604
256 9642969110626
1024858996759521

patched:
64  13493   13463   13313
128 13305   13447   12705
256 13190   13161   12954
102411791   11408   11786

No variation across various column counts, but the patched improves
the tps for each case by quite a bit.

-Mprepared (plan_cache_mode=force_generic_plan)

master:
64  228622852266
128 116311271091
256 531 519 544
102477  71  69

patched:
64  652266126275
128 356836253372
256 184717101823
1024433 427 386

Again, no variation across columns counts.  tps drops as partition
count increases both before and after applying the patches, although
patched performs way better, which is mainly attributable to the
ability of UP

Re: pgbench - add pseudo-random permutation function

2021-03-31 Thread Dean Rasheed
On Wed, 31 Mar 2021 at 09:02, Fabien COELHO  wrote:
>
> >> First, I have a thing against erand48.
> >
> Also, there is a 64 bits seed provided to the function which instantly
> ignores 16 of them, which looks pretty silly to me.
>

Yeah, that was copied from set_random_seed().

> At least, I suggest that two 48-bits prng could be initialized with parts
> of the seed and used in different places, eg for r & m.
>

That could work. I'd certainly feel better about that than
implementing a whole new PRNG.

> Also, the seed could be used to adjust the rotation, maybe.
>

Perhaps. I'm not sure it's really necessary though.

> >> I'm really at odds with FULL SHIFT 1, because it means that up to 1/256 of
> >> values are kept out of STEERING. [...]
> >
> > Ah, that's a good point. Something else that also concerned me there was
> > that it might lead to 2 consecutive full shifts with nothing in between,
> > which would lead to less uniform randomness (like the Irwin-Hall
> > distribution). I just did a quick test without the first full shift, and
> > the results do appear to be better,
>
> Indeed, it makes sense to me.
>

OK, attached is an update making this change and simplifying the
rotate code, which hopefully just leaves the question of what (if
anything) to do with pg_erand48().

> >> Third, I think that the rotate code can be simplified, in particular
> >> the ?: should be avoided because it may induce branches quite damaging
> >> to processor performance.
> >
> > Yeah, I wondered about that. Perhaps there's a "trick" that can be
> > used to simplify it. Pre-computing the number of bits in the mask
> > would probably help.
>
> See pg_popcount64().
>

Actually, I used pg_leftmost_one_pos64() to calculate the mask length,
allowing the mask to be computed from that, so there is no longer a
need for compute_mask(), which seems like a neat little
simplification.

Regards,
Dean
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
new file mode 100644
index 50cf22b..84d9566
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -1057,7 +1057,7 @@ pgbench  options<
 
   
 default_seed 
-   seed used in hash functions by default
+   seed used in hash and pseudorandom permutation functions by default
   
 
   
@@ -1866,6 +1866,24 @@ SELECT 4 AS four \; SELECT 5 AS five \as
 
   

+permute ( i, size [, seed ] )
+integer
+   
+   
+Permuted value of i, in the range
+[0, size).  This is the new position of
+i (modulo size) in a
+pseudorandom permutation of the integers 0...size-1,
+parameterized by seed.
+   
+   
+permute(0, 4)
+an integer between 0 and 3
+   
+  
+
+  
+   
 pi ()
 double

@@ -2071,29 +2089,70 @@ f(x) = PHI(2.0 * parameter * (x - mu) /
 

 
+   
+
+  When designing a benchmark which selects rows non-uniformly, be aware
+  that the rows chosen may be correlated with other data such as IDs from
+  a sequence or the physical row ordering, which may skew performance
+  measurements.
+
+
+  To avoid this, you may wish to use the permute
+  function, or some other additional step with similar effect, to shuffle
+  the selected rows and remove such correlations.
+
+   
+
   
 Hash functions hash, hash_murmur2 and
 hash_fnv1a accept an input value and an optional seed parameter.
 In case the seed isn't provided the value of :default_seed
 is used, which is initialized randomly unless set by the command-line
--D option. Hash functions can be used to scatter the
-distribution of random functions such as random_zipfian or
-random_exponential. For instance, the following pgbench
-script simulates possible real world workload typical for social media and
-blogging platforms where few accounts generate excessive load:
+-D option.
+  
+
+  
+permute accepts an input value, a size, and an optional
+seed parameter.  It generates a pseudorandom permutation of integers in
+the range [0, size), and returns the index of the input
+value in the permuted values.  The permutation chosen is parameterized by
+the seed, which defaults to :default_seed, if not
+specified.  Unlike the hash functions, permute ensures
+that there are no collisions or holes in the output values.  Input values
+outside the interval are interpreted modulo the size.  The function raises
+an error if the size is not positive.  permute can be
+used to scatter the distribution of non-uniform random functions such as
+random_zipfian or random_exponential
+so that values drawn more often are not trivially correlated.  For
+instance, the following pgbench script
+simulates a possible real world workload typical for social media and
+blogging platforms where a few accounts generate excessive load:
 
 
-\se

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-31 Thread Bruce Momjian
On Wed, Mar 31, 2021 at 11:25:32AM +0800, Julien Rouhaud wrote:
> On Thu, Mar 25, 2021 at 10:36:38AM +0800, Julien Rouhaud wrote:
> > On Wed, Mar 24, 2021 at 01:02:00PM -0300, Alvaro Herrera wrote:
> > > On 2021-Mar-24, Julien Rouhaud wrote:
> > > 
> > > > From e08c9d5fc86ba722844d97000798de868890aba3 Mon Sep 17 00:00:00 2001
> > > > From: Bruce Momjian 
> > > > Date: Mon, 22 Mar 2021 17:43:23 -0400
> > > > Subject: [PATCH v20 2/3] Expose queryid in pg_stat_activity and
> > > 
> > > >  src/backend/executor/execMain.c   |   9 ++
> > > >  src/backend/executor/execParallel.c   |  14 ++-
> > > >  src/backend/executor/nodeGather.c |   3 +-
> > > >  src/backend/executor/nodeGatherMerge.c|   4 +-
> > > 
> > > Hmm...
> > > 
> > > I find it odd that there's executor code that acquires the current query
> > > ID from pgstat, after having been put there by planner or ExecutorStart
> > > itself.  Seems like a modularity violation.  I wonder if it would make
> > > more sense to have the value maybe in struct EState (or perhaps there's
> > > a better place -- but I don't think they have a way to reach the
> > > QueryDesc anyhow), put there by ExecutorStart, so that places such as
> > > execParallel, nodeGather etc don't have to fetch it from pgstat but from
> > > EState.
> > 
> > The current queryid is already available in the Estate, as the underlying
> > PlannedStmt contains it.  The problem is that we want to display the top 
> > level
> > queryid, not the current query one, and the top level queryid is held in
> > pgstat.
> 
> So is the current approach ok?  If not I'm afraid that detecting and caching
> the top level queryid in the executor parts would lead to some code
> duplication.

I assume it is since Alvaro didn't reply.  I am planning to apply this
soon.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.





Re: UniqueKey on Partitioned table.

2021-03-31 Thread Ashutosh Bapat
> b).  How to present the element
> in UniqueKey.  Prue EquivalenceClasses or Mix of Expr and EquivalenceClass as
> we just talked about.
I think the reason we add ECs for sort expressions is to use
transitive relationship. The EC may start with a single member but
later in the planning that member might find partners which are all
equivalent. Result ordered by one is also ordered by the other. The
same logic applies to UniqueKey as well, isn't it. In a result if a
set of columns makes a row unique, the set of columns represented by
the other EC member should be unique. Though a key will start as a
singleton it might EC partners later and thus thus unique key will
transition to all the members. With that logic UniqueKey should use
just ECs instead of bare expressions.

-- 
Best Wishes,
Ashutosh Bapat




Re: Flaky vacuum truncate test in reloptions.sql

2021-03-31 Thread Masahiko Sawada
On Tue, Mar 30, 2021 at 10:22 PM Arseny Sher  wrote:
>
> On 3/30/21 10:12 AM, Michael Paquier wrote:
>
>  > Yep, this is the same problem as the one discussed for c2dc1a7, where
>  > a concurrent checkpoint may cause a page to be skipped, breaking the
>  > test.
>
> Indeed, Alexander Lakhin pointed me to that commit after I wrote the
> message.
>
>  > Why not just using DISABLE_PAGE_SKIPPING instead here?
>
> I think this is not enough. DISABLE_PAGE_SKIPPING disables vm consulting
> (sets
> aggressive=true in the routine); however, if the page is locked and
> lazy_check_needs_freeze says there is nothing to freeze on it, we again
> don't
> look at its contents closely.

Right.

Is it better to add FREEZE to the first "VACUUM reloptions_test;" as well?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: [PATCH] add concurrent_abort callback for output plugin

2021-03-31 Thread Amit Kapila
On Wed, Mar 31, 2021 at 11:55 AM Markus Wanner
 wrote:
>
> On 31.03.21 06:39, Amit Kapila wrote:
> > I have slightly adjusted the comments, docs, and commit message. What
> > do you think about the attached?
>
> Thank you both, Amit and Ajin.  This looks good to me.
>
> Only one minor gripe:
>
> > + a prepared transaction with incomplete changes, in which case the
> > + concurrent_abort field of the passed
> > + ReorderBufferTXN struct is set. This is done so 
> > that
> > + eventually when the ROLLBACK PREPARED is decoded, 
> > there
> > + is a corresponding prepared transaction with a matching gid.
>
> The last sentences there now seems to relate to just the setting of
> "concurrent_abort", rather than the whole reason to invoke the
> prepare_cb.  And the reference to the "gid" is a bit lost.  Maybe:
>
> "Thus even in case of a concurrent abort, enough information is
>  provided to the output plugin for it to properly deal with the
>  ROLLBACK PREPARED once that is decoded."
>

Okay, Changed the patch accordingly.

-- 
With Regards,
Amit Kapila.


v5-0001-Ensure-to-send-a-prepare-after-we-detect-concurre.patch
Description: Binary data


Re: [PATCH] Allow multiple recursive self-references

2021-03-31 Thread Denis Hirn
Based on Toms feedback, and due to the fact that SQL:2021 forbidsnon-linear recursion, version 2 of the patch allows only linearrecursion. Therefore, later SQL committee decisions on non-linearrecursion should not be problematic.> [LIN] PostgreSQL does not allow multiple references to the recursive>   common table, even if the recursion is LINEAR.  Plenty of examples>   of such queries are found in the documentation of established RDBMSs,>   among these IBM Db2 and MS SQL Server:>>   - https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_73/sqlp/rbafyrecursivequeries.htm#d60691e2455> (example "Two tables used for recursion using recursive common> table expressions")See the gist at [1] below for SQL code that features multiple (yet linear)recursive references.  A patched PostgreSQL runs this query as expected.Best wishes,   --Denis and Torsten[1] https://gist.github.com/kryonix/73d77d3eaa5a15b3a4bdb7d590fa1253

0002-Allow-multiple-recursive-self-references.patch
Description: Binary data



Re: invalid data in file backup_label problem on windows

2021-03-31 Thread David Steele

On 3/29/21 4:34 AM, Magnus Hagander wrote:

On Mon, Mar 29, 2021 at 7:01 AM Michael Paquier  wrote:


On Sun, Mar 28, 2021 at 09:29:10AM -0400, Andrew Dunstan wrote:

- vital to the backup working, and must be written without modification.
+ vital to the backup working and must be written without
modification, which
+ may include opening the file in binary mode.


+= "on Windows"?


I'd say no, better to just tell people to always open the file in
binary mode. It's not hurtful anywhere, there's really no reason ever
to open it in text mode.


Agreed. New patch attached.

Regards,
--
-David
da...@pgmasters.net
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index c5557d5444..8c9186d277 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -913,7 +913,8 @@ SELECT * FROM pg_stop_backup(false, true);
  backup_label in the root directory of the backup. The
  third field should be written to a file named
  tablespace_map unless the field is empty. These 
files are
- vital to the backup working, and must be written without modification.
+ vital to the backup working and must be written byte for byte without 
+ modification, which may require opening the file in binary mode.
 




Re: Flaky vacuum truncate test in reloptions.sql

2021-03-31 Thread Arseny Sher



On 3/31/21 4:17 PM, Masahiko Sawada wrote:

> Is it better to add FREEZE to the first "VACUUM reloptions_test;" as 
well?


I don't think this matters much, as it tests the contrary and the 
probability of
successful test passing (in case of theoretical bug making vacuum to 
truncate

non-empty relation) becomes stunningly small. But adding it wouldn't hurt
either.

-- cheers, arseny




Re: [PATCH] add concurrent_abort callback for output plugin

2021-03-31 Thread Markus Wanner

On 31.03.21 15:18, Amit Kapila wrote:

On Wed, Mar 31, 2021 at 11:55 AM Markus Wanner

The last sentences there now seems to relate to just the setting of
"concurrent_abort", rather than the whole reason to invoke the
prepare_cb.  And the reference to the "gid" is a bit lost.  Maybe:

 "Thus even in case of a concurrent abort, enough information is
  provided to the output plugin for it to properly deal with the
  ROLLBACK PREPARED once that is decoded."


Okay, Changed the patch accordingly.


That's fine with me.  I didn't necessarily mean to eliminate the hint to 
the concurrent_abort field, but it's more concise that way.  Thank you.


Regards

Markus




Prevent query cancel packets from being replayed by an attacker (From TODO)

2021-03-31 Thread Sebastian Cabot
Hello,

My name is Sebastian and I am new to this list and community.
I have been following PostgreSQL for several years and I love the work done
on it, but I never had the chance (time) to join.

I was going through the TODO list and studied the code and the thread
discussing the optional fixes  and I think I have a solution to this one
which has the following advantages:
1. No change to the protocol is needed
2. Can be implemented in a both forward and backward compatible way
3. Does not require any shared memory trickery
4. Is immune to brute force attacks  (probably)

If this is still something we wish to fix I will be happy to share the
details (and implement it) - I don't wish to burden you with the details if
there is no real interest in solving this.

Cheers
Sebastian


Re: [PATCH] Allow multiple recursive self-references

2021-03-31 Thread Denis Hirn
Sorry, I didn't append the patch properly.

Best wishes,
   --Denis



v2-0001-Allow-multiple-recursive-self-references.patch
Description: Binary data


Re: Crash in record_type_typmod_compare

2021-03-31 Thread Sait Talha Nisanci
>We should probably do 
>HASH_ENTER
> only after we have a valid entry so that we don't end up with a NULL entry in 
>the cache even if an intermediate error happens. I will share a fix in this 
>thread soon.

I am attaching this patch.







From: Sait Talha Nisanci
Sent: Wednesday, March 31, 2021 11:50 AM
To: pgsql-hackers 
Cc: Metin Doslu 
Subject: Crash in record_type_typmod_compare

Hello,

In citus, we have seen the following crash backtraces because of a NULL 
tupledesc multiple times and we weren't sure if this was related to citus or 
postgres:


#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
417 tupdesc.c: No such file or directory.
#0  equalTupleDescs (tupdesc1=0x0, tupdesc2=0x1b9f3f0) at tupdesc.c:417
#1  0x0085b51f in record_type_typmod_compare (a=, 
b=, size=) at typcache.c:1761
#2  0x00869c73 in hash_search_with_hash_value (hashp=0x1c10530, 
keyPtr=keyPtr@entry=0x7ffcfd3117b8, hashvalue=3194332168, 
action=action@entry=HASH_ENTER, foundPtr=foundPtr@entry=0x7ffcfd3117c0) at 
dynahash.c:987
#3  0x0086a3fd in hash_search (hashp=, 
keyPtr=keyPtr@entry=0x7ffcfd3117b8, action=action@entry=HASH_ENTER, 
foundPtr=foundPtr@entry=0x7ffcfd3117c0) at dynahash.c:911
#4  0x0085d0e1 in assign_record_type_typmod (tupDesc=, 
tupDesc@entry=0x1b9f3f0) at typcache.c:1801
#5  0x0061832b in BlessTupleDesc (tupdesc=0x1b9f3f0) at 
execTuples.c:2056
#6  TupleDescGetAttInMetadata (tupdesc=0x1b9f3f0) at execTuples.c:2081
#7  0x7f2701878dee in CreateDistributedExecution 
(modLevel=ROW_MODIFY_READONLY, taskList=0x1c82398, hasReturning=, paramListInfo=0x1c3e5a0, tupleDescriptor=0x1b9f3f0, tupleStore=, targetPoolSize=16, xactProperties=0x7ffcfd311960, jobIdList=0x0) at 
executor/adaptive_executor.c:951
#8  0x7f270187ba09 in AdaptiveExecutor (scanState=0x1b9eff0) at 
executor/adaptive_executor.c:676
#9  0x7f270187c582 in CitusExecScan (node=0x1b9eff0) at 
executor/citus_custom_scan.c:182
#10 0x0060c9e2 in ExecProcNode (node=0x1b9eff0) at 
../../../src/include/executor/executor.h:239
#11 ExecutePlan (execute_once=, dest=0x1abfc90, 
direction=, numberTuples=0, sendTuples=, 
operation=CMD_SELECT, use_parallel_mode=, planstate=0x1b9eff0, 
estate=0x1b9ed50) at execMain.c:1646
#12 standard_ExecutorRun (queryDesc=0x1c3e660, direction=, 
count=0, execute_once=) at execMain.c:364
#13 0x7f27018819bd in CitusExecutorRun (queryDesc=0x1c3e660, 
direction=ForwardScanDirection, count=0, execute_once=true) at 
executor/multi_executor.c:177
#14 0x7f27000adfee in pgss_ExecutorRun (queryDesc=0x1c3e660, 
direction=ForwardScanDirection, count=0, execute_once=) at 
pg_stat_statements.c:891
#15 0x0074f97d in PortalRunSelect (portal=portal@entry=0x1b8ed00, 
forward=forward@entry=true, count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x1abfc90) at pquery.c:929
#16 0x00750df0 in PortalRun (portal=portal@entry=0x1b8ed00, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, 
run_once=, dest=dest@entry=0x1abfc90, 
altdest=altdest@entry=0x1abfc90, completionTag=0x7ffcfd312090 "") at 
pquery.c:770
#17 0x0074e745 in exec_execute_message (max_rows=9223372036854775807, 
portal_name=0x1abf880 "") at postgres.c:2090
#18 PostgresMain (argc=, argv=argv@entry=0x1b4a0e8, 
dbname=, username=) at postgres.c:4308
#19 0x006de9d8 in BackendRun (port=0x1b37230, port=0x1b37230) at 
postmaster.c:4437
#20 BackendStartup (port=0x1b37230) at postmaster.c:4128
#21 ServerLoop () at postmaster.c:1704
#22 0x006df955 in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1aba280) at postmaster.c:1377
#23 0x00487a4e in main (argc=3, argv=0x1aba280) at main.c:228

This is the issue: https://github.com/citusdata/citus/issues/3825


I think this is related to postgres because of the following events:

  *   In 
assign_record_type_typmod
 tupledesc will be set to NULL if it is not in the cache and it will be set to 
an actual value in this 
line.
  *   It is possible that postgres will error in between these two lines, hence 
leaving a NULL tupledesc in the cache. For example in 
find_or_make_matching_shared_tupledesc.
 (Possibly because of OOM)
  *   Now there is a NULL tupledesc in the hash table, hence when doing a 
comparison in 
record_type_typmod_compare,
 i

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2021-03-31 Thread Alvaro Herrera
On 2021-Mar-31, Bruce Momjian wrote:

> On Wed, Mar 31, 2021 at 11:25:32AM +0800, Julien Rouhaud wrote:
> > On Thu, Mar 25, 2021 at 10:36:38AM +0800, Julien Rouhaud wrote:
> > > On Wed, Mar 24, 2021 at 01:02:00PM -0300, Alvaro Herrera wrote:

> > > > I find it odd that there's executor code that acquires the current query
> > > > ID from pgstat, after having been put there by planner or ExecutorStart
> > > > itself.  Seems like a modularity violation.  I wonder if it would make
> > > > more sense to have the value maybe in struct EState (or perhaps there's
> > > > a better place -- but I don't think they have a way to reach the
> > > > QueryDesc anyhow), put there by ExecutorStart, so that places such as
> > > > execParallel, nodeGather etc don't have to fetch it from pgstat but from
> > > > EState.
> > > 
> > > The current queryid is already available in the Estate, as the underlying
> > > PlannedStmt contains it.  The problem is that we want to display the top 
> > > level
> > > queryid, not the current query one, and the top level queryid is held in
> > > pgstat.
> > 
> > So is the current approach ok?  If not I'm afraid that detecting and caching
> > the top level queryid in the executor parts would lead to some code
> > duplication.
> 
> I assume it is since Alvaro didn't reply.  I am planning to apply this
> soon.

I'm afraid I don't know enough about how parallel query works to make a
good assessment on this being a good approach or not -- and no time at
present to figure it all out.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"I think my standards have lowered enough that now I think 'good design'
is when the page doesn't irritate the living f*ck out of me." (JWZ)




Re: Calendar support in localization

2021-03-31 Thread Surafel Temesgen
On Tue, Mar 30, 2021 at 11:16 AM Daniel Verite 
wrote:

>
> The conversions from julian dates are not necessarily hard, but the
> I/O functions means having localized names for all days, months, eras
> of all calendars in all supported languages. If you're thinking of
> implementing this from scratch (without the ICU dependency), where
> would these names come from? OTOH if we're using ICU, then why
> bother reinventing the julian-to-calendars conversions that ICU
> already does?
>
>
i donno why  but  currently we are using our own function for
converting (see j2date and date2j) maybe it's written before ICU but i
think ICU helps in adding other calendar support easly. Regarding  I/O
functions postgresql hard coded days and months names on array and just
parse and string compare, if it is not on the list then error(see
datetime.c) and it will be the same for other calendar but i think we don't
need all  that if we use ICU

regards
Surafel


Re: Prevent query cancel packets from being replayed by an attacker (From TODO)

2021-03-31 Thread Laurenz Albe
On Wed, 2021-03-31 at 16:54 +0300, Sebastian Cabot wrote:
> My name is Sebastian and I am new to this list and community.
> I have been following PostgreSQL for several years and I love the work done on
>  it, but I never had the chance (time) to join.
> 
> I was going through the TODO list and studied the code and the thread 
> discussing the
>  optional fixes  and I think I have a solution to this one which has the 
> following advantages:
> 1. No change to the protocol is needed
> 2. Can be implemented in a both forward and backward compatible way
> 3. Does not require any shared memory trickery
> 4. Is immune to brute force attacks  (probably)
> 
> If this is still something we wish to fix I will be happy to share the 
> details (and
>  implement it) - I don't wish to burden you with the details if there is no 
> real interest in solving this.

Thank you for your willingness to help!

Sure, this is the place to discuss your idea, go ahead.

Right now is the end of the final commitfest for v14, so people
are busy getting patches committed and you may get less echo than normally.

Yours,
Laurenz Albe





Re: cursor already in use, UPDATE RETURNING bug?

2021-03-31 Thread Jaime Casanova
On Wed, Mar 31, 2021 at 7:50 AM Ashutosh Bapat
 wrote:
>
> On Wed, Mar 31, 2021 at 6:09 AM Jaime Casanova
>
> >
> > -- this fails
> > update t2 set i = 5 returning cursor_bug() as c1;
> > ERROR:  cursor "c1" already in use
> > CONTEXT:  PL/pgSQL function cursor_bug() line 6 at OPEN
>
> but that's called as many time as the number of rows in t2 in the same
> transaction. The first row will go fine. For the second row it will
> find c1 is already open. Shouldn't cursor_bug() close c1 at the end?
> Is it intended to be kept open when the function finishes? May be you
> are expecting it to be closed automatically when the function
> finishes. But that's not what is documented at
> https://www.postgresql.org/docs/13/plpgsql-cursors.html.
>

Now that I see it again, after sleeping, I can see you're right! sorry
for the noise

-- 
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL




Re: unconstrained memory growth in long running procedure stored procedure after upgrading 11-12

2021-03-31 Thread Merlin Moncure
On Tue, Mar 30, 2021 at 7:14 PM Tom Lane  wrote:
>
> Justin Pryzby  writes:
> > On Tue, Mar 30, 2021 at 04:17:03PM -0500, Merlin Moncure wrote:
> >> We just upgraded from postgres 11 to 12.6 and our server is running
> >> out of memory and rebooting about 1-2 times a day.
>
> > I haven't tried your test, but this sounds a lot like the issue I reported 
> > with
> > JIT, which is enabled by default in v12.
>
> FWIW, I just finished failing to reproduce any problem with that
> test case ... but I was using a non-JIT-enabled build.

Yep.  Disabling jit (fully, fia jit=off, not what was suggested
upthread) eliminated the issue, or at least highly mitigated the leak.
I was using pgdg rpm packaging, which enables jit by default.  Thanks
everyone for looking at this, and the workaround is quick and easy.

merlin




Re: TRUNCATE on foreign table

2021-03-31 Thread Kohei KaiGai
2021年3月30日(火) 2:53 Fujii Masao :
>
> On 2021/03/28 2:37, Kazutaka Onishi wrote:
> > Fujii-san,
> >
> > Thank you for your review!
> > Now I prepare v5 patch and I'll answer to your each comment. please
> > check this again.
>
> Thanks a lot!
>
> > 5. For example, we can easily do that by truncate foreign tables
> > before local ones. Thought?
> >
> > Umm... yeah, I feel it's better procedure, but not so required because
> > TRUNCATE is NOT called frequently.
> > Certainly, we already have postgresIsForeignUpdatable() to check
> > whether the foreign table is updatable or not.
> > Following this way, we have to add postgresIsForeignTruncatable() to check.
> > However, Unlike UPDATE, TRUNCATE is NOT called frequently. Current
> > procedure is inefficient but works correctly.
> > Thus, I feel postgresIsForeignTruncatable() is not needed.
>
> I'm concerned about the case where permission errors at the remote servers
> rather than that truncatable option is disabled. The comments of
> ExecuteTruncate() explains its design as follows. But the patch seems to break
> this because it truncates the local tables before checking the permission on
> foreign tables (i.e., the local tables in remote servers)... No?
>
>  We first open and grab exclusive
>  lock on all relations involved, checking permissions and otherwise
>  verifying that the relation is OK for truncation
>  Finally all the relations are truncated and reindexed.
>
Fujii-san,

What does the "permission checks" mean in this context?
The permission checks on the foreign tables involved are already checked
at truncate_check_rel(), by PostgreSQL's standard access control.

Please assume an extreme example below:
1. I defined a foreign table with file_fdw onto a local CSV file.
2. Someone tries to scan the foreign table, and ACL allows it.
3. I disallow the read remission of the CSV using chmod, after the above step,
but prior to the query execution.
4. Someone's query moved to the execution stage, then IterateForeignScan()
raises an error due to OS level permission checks.

FDW is a mechanism to convert from/to external data sources to/from PostgreSQL's
structured data, as literal. Once we checked the permissions of
foreign-tables by
database ACLs, any other permission checks handled by FDW driver are a part of
execution (like, OS permission check when file_fdw read(2) the
underlying CSV files).
And, we have no reliable way to check entire permissions preliminary,
like OS file
permission check or database permission checks by remote server. Even
if a checker
routine returned a "hint", it may be changed at the execution time.
Somebody might
change the "truncate" permission at the remote server.

How about your opinions?

Best regards,
-- 
HeteroDB, Inc / The PG-Strom Project
KaiGai Kohei 




Re: Allow an alias to be attached directly to a JOIN ... USING

2021-03-31 Thread Peter Eisentraut



On 23.03.21 00:18, Tom Lane wrote:

However, ParseNamespaceItem as it stands needs some help for this.
It has a wired-in assumption that p_rte->eref describes the table
and column aliases exposed by the nsitem.  0001 below fixes this by
creating a separate p_names field in an nsitem.  (There are some
comments in 0001 referencing JOIN USING aliases, but no actual code
for the feature.)  That saves one indirection in common code paths,
so it's possibly a win on its own.  Then 0002 is your patch rebased
onto that infrastructure, and with some cleanup of my own.


Makes sense.  I've committed it based on that.


Speaking of decompiled views, I feel like ruleutils.c could do with
a little more work to teach it that these aliases are available.
Right now, it resorts to ugly workarounds:


Yeah, the whole has_dangerous_join_using() can probably be unwound and 
removed with this.  But it's a bit of work.



One other cosmetic thing is that this:

regression=# select tu.* from (t1 join t2 using(a) as tu) tx;
ERROR:  missing FROM-clause entry for table "tu"
LINE 1: select tu.* from (t1 join t2 using(a) as tu) tx;
^

is a relatively dumb error message, compared to

regression=# select t1.* from (t1 join t2 using(a) as tu) tx;
ERROR:  invalid reference to FROM-clause entry for table "t1"
LINE 1: select t1.* from (t1 join t2 using(a) as tu) tx;
^
HINT:  There is an entry for table "t1", but it cannot be referenced from this 
part of the query.

I didn't look into why that isn't working, but maybe errorMissingRTE
needs to trawl all of the ParseNamespaceItems not just the RTEs.


Yes, I've prototyped that and it would have the desired effect.  Might 
need some code rearranging, like either change searchRangeTableForRel() 
to not return an RTE or make a similar function for ParseNamespaceItem 
search.  Needs some more thought.  I have left a test case in that would 
show any changes here.





Re: making update/delete of inheritance trees scale better

2021-03-31 Thread Tom Lane
Amit Langote  writes:
> On Tue, Mar 30, 2021 at 1:51 PM Tom Lane  wrote:
>> Here's a v13 patchset that I feel pretty good about.

> Thanks.  After staring at this for a day now, I do too.

Thanks for looking!  Pushed after some more docs-fiddling and a final
read-through.  I think the only code change from v13 is that I decided
to convert ExecGetJunkAttribute into a "static inline", since it's
just a thin wrapper around slot_getattr().  Doesn't really help
performance much, but it shouldn't hurt.

> ... The drop as partition count increases can
> be attributed to the fact that with a generic plan, there are a bunch
> of steps that must be done across all partitions, such as
> AcauireExecutorLocks(), ExecCheckRTPerms(), per-result-rel
> initializations in ExecInitModifyTable(), etc., even with the patched.
> As mentioned upthread, [1] can help with the last bit.

I'll try to find some time to look at that one.

I'd previously been thinking that we couldn't be lazy about applying
most of those steps at executor startup, but on second thought,
ExecCheckRTPerms should be a no-op anyway for child tables.  So
maybe it would be okay to not take a lock, much less do the other
stuff, until the particular child table is stored into.

regards, tom lane




Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Robert Haas
On Mon, Mar 29, 2021 at 12:16 AM Peter Geoghegan  wrote:
> And now here's v8, which has the following additional cleanup:

I can't effectively review 0001 because it both changes the code for
individual functions significantly and reorders them within the file.
I think it needs to be separated into two patches, one of which makes
the changes and the other of which reorders stuff. I would probably
vote for just dropping the second one, since I'm not sure there's
really enough value there to justify the code churn, but if we're
going to do it, I think it should definitely be done separately.

Here are a few comments on the parts I was able to understand:

* "onerel" is a stupid naming convention that I'd rather not propagate
further. It makes sense in the context of a function whose job it is
to iterate over a list of relations and do something for each one. But
once you're down into the code that only knows about one relation in
the first place, calling that relation "onerel" rather than "rel" or
"vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

* Moving stuff from static variables into LVRelState seems like a
great idea. Renaming it from LVRelStats seems like a good idea, too.

* Setting vacrel->lps = NULL "for now" when we already did palloc0 at
allocation time seems counterproductive.

* The code associated with the comment block that says "Initialize
state for a parallel vacuum" has been moved inside lazy_space_alloc().
That doesn't seem like an especially good choice, because no casual
reader is going to expect a function called lazy_space_alloc() to be
entering parallel mode and so forth as a side effect. Also, the call
to lazy_space_alloc() still has a comment that says "Allocate the
space for dead tuples in case parallel vacuum is not initialized."
even though the ParallelVacuumIsActive() check has been removed and
the function now does a lot more than allocating space.

* lazy_scan_heap() removes the comment which begins "Note that
vacrelstats->dead_tuples could have tuples which became dead after
HOT-pruning but are not marked dead yet." But IIUC that special case
is removed by a later patch, not 0001, in which case it is that patch
that should be touching this comment.

Regarding 0002:

* It took me a while to understand why lazy_scan_new_page() and
lazy_scan_empty_page() are named the way they are. I'm not sure
exactly what would be better, so I am not necessarily saying I think
you have to change anything, but for the record I think this naming
sucks. The reason we have "lazy" in here, AFAIU, is because originally
we only had old-style VACUUM FULL, and that was the good hard-working
VACUUM, and what we now think of as VACUUM was the "lazy" version that
didn't really do the whole job. Then we decided it was the
hard-working version that actually sucked and we always wanted to be
lazy (or else rewrite the table). So now we have all of these
functions named "lazy" which are really just functions to do "vacuum".
But, if we just did s/lazy/vacuum/g we'd be in trouble, because we use
"vacuum" to mean "part of vacuum." That's actually a pretty insane
thing to do, but we like terminological confusion so much that we
decided to use the word vacuum not just to refer to one part of vacuum
but to two different parts of vacuum. During heap vacuuming, which is
the relevant thing here, we call the first part a "scan" and the
second part "vacuum," hence lazy_scan_page() and lazy_vacuum_page().
For indexes, we can decide to vacuum indexes or cleanup indexes,
either of which is part of our overall strategy of trying to do a
VACUUM. We need some words here that are not so overloaded. If, for
example, we could agree that the whole thing is vacuum and the first
time we touch the heap page that's the strawberry phase and then the
second time we touch it that's the rhubarb phase, then we could have
vacuum_strawberry_page(), vacuum_strawberry_new_page(),
vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
assuming that you replaced the words "strawberry" and "rhubarb" with
something actually meaningful. But that seems hard. I thought about
suggesting that the word for strawberry should be "prune", but it does
more than that. I thought about suggesting that either the word for
strawberry or the word for rhubarb should be "cleanup," but that's
another word that is already confusingly overloaded. So I don't know.

* But all that having been said, it's easy to get confused and think
that lazy_scan_new_page() is scanning a new page for lazy vacuum, but
in fact it's the new-page handler for the scan phase of lazy vacuum,
and it doesn't scan anything at all. If there's a way to avoid that
kind of confusion, +1 from me.

* One possibility is that maybe it's not such a great idea to put this
logic in its own function. I'm rather suspicious on principle of
functions that are called with a locked or pinned buffer and release
the lock or pin before returning. It suggests that the abstraction is

Re: Prevent query cancel packets from being replayed by an attacker (From TODO)

2021-03-31 Thread Sebastian Cabot
On Wed, Mar 31, 2021 at 5:44 PM Laurenz Albe  wrote:
>
> On Wed, 2021-03-31 at 16:54 +0300, Sebastian Cabot wrote:
> > My name is Sebastian and I am new to this list and community.
> > I have been following PostgreSQL for several years and I love the work done 
> > on
> >  it, but I never had the chance (time) to join.
> >
> > I was going through the TODO list and studied the code and the thread 
> > discussing the
> >  optional fixes  and I think I have a solution to this one which has the 
> > following advantages:
> > 1. No change to the protocol is needed
> > 2. Can be implemented in a both forward and backward compatible way
> > 3. Does not require any shared memory trickery
> > 4. Is immune to brute force attacks  (probably)
> >
> > If this is still something we wish to fix I will be happy to share the 
> > details (and
> >  implement it) - I don't wish to burden you with the details if there is no 
> > real interest in solving this.
>
> Thank you for your willingness to help!
>
> Sure, this is the place to discuss your idea, go ahead.
>
> Right now is the end of the final commitfest for v14, so people
> are busy getting patches committed and you may get less echo than normally.
>
> Yours,
> Laurenz Albe
>
Thank you Laurenz.
I will post the details. Hopefully some developers will find the time
to review the idea.

Summary of the problem:
To cancel a query a query cancel message is sent. This message is
always sent unencrypted even if SSL is enabled because there is a need
for the cancel message to be sent from signals. This opens up the
possibility of a replay attack since the cancel message contains a
cancel auth key which does not change after the backend is started.

Summary of the solution proposed in the discussion on the thread:
https://www.postgresql.org/message-id/489c969d.8020...@enterprisedb.com
Not all the details were agreed upon but the trend was as follows:
1. The current way of canceling a request shall continue to be
supported (at least until the next protocol version update)
2. A new cancel message format and processing method shall be
supported alongside the original and the server shall advertise in the
startup runtime parameters whether this new method is supported
3. A new libpq client that supports the new method shall know to use
it if the server supports it
4. The new method involves the client sending a hashed representation
of the cancel authkey so that the key is never sent in clear text
5. There will be a mechanism (an integer which is incremented before
every cancel message) to generate a new hash for every request
6. When using the new method the postmaster shall check that the
integer of a new request is larger than the one used for the last
request
8. There will be a postmaster configuration parameter to decide
whether the new method is enforced or is optional

The above suggestion is not bad and quite appealing but in reality it
forms a protocol change, It requires changes to both the server and
the client.

Details for new suggestion:

1. We notice that the incentive for the elaborate solution proposed is
that the cancel auth key cannot be regenerated easily (Without
touching shared memory in a way that is not trivial with the current
implementation especially for non EXEC_BACKEND).
2. If a new cancelation authkey can be regenerated once a cancel
message was sent then we could just send the new key and no changes to
libpq or the protocol are required.
2.1. One problem with such an approach is that other clients (JDBC?)
only read the key at startup and this shall be addressed below
3. There will be a postmaster configuration parameter to decide
whether the new method is enforced or is optional

Here is how the new design will allow generating a new "random"
authkey without any shared memory trickery:
We will add two backend variables:
int32 original_cancel_key; (To allow cancels from client that do not
update the key in case the new method is not enforced by the server)
char  random_bits[32];

POSTMASTER:
When creating a new backend the original_cancel_key shall have a copy
of the random key generated. ramdom_bits shall be initialized using
pg_strong_random

When a new cancel message request arrives:
POSTMASTER:
(If the new method is not enforced then also a match against the
original_cancel_key shall be accepted)
If the message's cancelAuthCode matches the current cancel_key then
1. Generate a new key by a deterministic algorithm from random_bits
2. regenerate random_bits using a SHA256 of the current key and the
current random bits
3. As in today send SIGINT to backend
BACKEND:
Upon receiving SIGINT mark that a new key should be generated.
When appropriate in the loop:
1. Generate a new key by a deterministic algorithm from random_bits
(same way POSTMASTER did so it will get the exact same key)
2. regenerate random_bits using a SHA256 of the current key and the
current random bits (The same way POSTMANGER did so it will get the
same randomness)
3. Send new key to cl

Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Pavel Stehule
>
>
>
> If using the -> notation, you would only need to manually
> inspect the tables involved in the remaining JOINs;
> since you could be confident all uses of -> cannot affect cardinality.
>
> I think this would be a win also for an expert SQL consultant working
> with a new complex data model never seen before.
>
>
I did not feel comfortable when I read about this proprietary extension of
SQL.  I can accept and it can be nice to support ANSI/SQL object's
referentions, but implementing own syntax for JOIN looks too strange. I
don't see too strong benefit in inventing new syntax and increasing the
complexity and possible disorientation of users about correct syntax. Some
users didn't adopt a difference between old joins and modern joins, and you
are inventing a third syntax.

Pavel


Re: making update/delete of inheritance trees scale better

2021-03-31 Thread Robert Haas
On Tue, Mar 30, 2021 at 12:51 AM Tom Lane  wrote:
> Maybe that could be made more robust, but the other problem
> is that the EXPLAIN output is just about unreadable; nobody will
> understand what "(0)" means.

I think this was an idea that originally came from me, prompted by
what we already do for:

rhaas=# explain verbose select 1 except select 2;
 QUERY PLAN
-
 HashSetOp Except  (cost=0.00..0.06 rows=1 width=8)
   Output: (1), (0)
   ->  Append  (cost=0.00..0.05 rows=2 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=8)
   Output: 1, 0
   ->  Result  (cost=0.00..0.01 rows=1 width=4)
 Output: 1
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..0.02 rows=1 width=8)
   Output: 2, 1
   ->  Result  (cost=0.00..0.01 rows=1 width=4)
 Output: 2
(11 rows)

That is admittedly pretty magical, but it's a precedent. If you think
the relation OID to subplan index lookup is fast enough that it
doesn't matter, then I guess it's OK, but I guess my opinion is that
the subplan index feels like the thing we really want, and if we're
passing anything else up the plan tree, that seems to be a decision
made out of embarrassment rather than conviction. I think the real
problem here is that the deparsing code isn't in on the secret. If in
the above example, or in this patch, it deparsed as (Subplan Index) at
the parent level, and 0, 1, 2, ... in the children, it wouldn't
confuse anyone, or at least not much more than EXPLAIN output does in
general.

Or even if we just output (Constant-Value) it wouldn't be that bad.
The whole convention of deparsing target lists by recursing into the
children, or one of them, in some ways obscures what's really going
on. I did a talk a few years ago in which I made those target lists
deparse as $OUTER.0, $OUTER.1, $INNER.0, etc. and I think people found
that pretty enlightening, because it's sort of non-obvious in what way
table foo is present when a target list 8 levels up in the join tree
claims to have a value for foo.x. Now, such notation can't really be
recommended in general, because it'd be too hard to understand what
was happening in a lot of cases, but the recursive stuff is clearly
not without its own attendant confusions.

Thanks to both of you for working on this! As I said before, this
seems like really important work.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Crash in record_type_typmod_compare

2021-03-31 Thread Tom Lane
Sait Talha Nisanci  writes:
>> We should probably do 
>> HASH_ENTER
>>  only after we have a valid entry so that we don't end up with a NULL entry 
>> in the cache even if an intermediate error happens. I will share a fix in 
>> this thread soon.
> I am attaching this patch.

I see the hazard, but this seems like an expensive way to fix it,
as it forces two hash searches for every insertion.  Couldn't we just
teach record_type_typmod_compare to say "not equal" if it sees a
null tupdesc?

regards, tom lane




Re: pg_amcheck contrib application

2021-03-31 Thread Robert Haas
On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
 wrote:
> I'm not looking at the old VACUUM FULL code, but my assumption is that if the 
> xvac code were resurrected, then when a tuple is moved off by a VACUUM FULL, 
> the old tuple and associated toast cannot be pruned until concurrent 
> transactions end.  So, if amcheck is running more-or-less concurrently with 
> the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM 
> FULL's xid, it can check the toast associated with the moved off tuple after 
> the VACUUM FULL commits.  If instead the VACUUM FULL xid was older than 
> amcheck's xmin, then the toast is in danger of being vacuumed away.  So the 
> logic in verify_heapam would need to change to think about this distinction.  
> We don't have to concern ourselves about that, because VACUUM FULL cannot be 
> running, and so the xid for it must be older than our xmin, and hence the 
> toast is unconditionally not safe to check.
>
> I'm changing the comments back to how you had them, but I'd like to know why 
> my reasoning is wrong.

Let's start by figuring out *whether* your reasoning is wrong. My
assumption was that old-style VACUUM FULL would move tuples without
retoasting. That is, if we decided to move a tuple from page 2 of the
main table to page 1, we would just write the tuple into page 1,
marking it moved-in, and on page 2 we would mark it moved-off. And
that we would not examine or follow any TOAST pointers at all, so
whatever TOAST entries existed would be shared between the two tuples.
One tuple or the other would eventually die, depending on whether xvac
went on to commit or abort, but either way the TOAST doesn't need
updating because there's always exactly 1 remaining tuple using
pointers to those TOAST values.

Your assumption seems to be the opposite, that the TOASTed values
would be retoasted as part of VF. If that is true, then your idea is
right.

Do you agree with this analysis? If so, we can check the code and find
out which way it actually worked.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Julien Rouhaud
On Wed, Mar 31, 2021 at 5:19 PM Joel Jacobson  wrote:
>
> If using the -> notation, you would only need to manually
> inspect the tables involved in the remaining JOINs;
> since you could be confident all uses of -> cannot affect cardinality.

Talking about that, do you have some answers to the points raised in
my previous mail, which is how it's supposed to behave when a table is
both join using your "->" syntax and a plain JOIN, how to join the
same table multiple time using this new syntax, and how to add
predicates to the join clause using  this new syntax.

> I think this would be a win also for an expert SQL consultant working
> with a new complex data model never seen before.

By experience if the queries are written with ANSI JOIN it's not
really a problem.  And if it's a new complex data model that was never
seen, I would need to inspect the data model first anyway to
understand what the query is (or should be) doing.




Re: making update/delete of inheritance trees scale better

2021-03-31 Thread Tom Lane
Robert Haas  writes:
> On Tue, Mar 30, 2021 at 12:51 AM Tom Lane  wrote:
>> Maybe that could be made more robust, but the other problem
>> is that the EXPLAIN output is just about unreadable; nobody will
>> understand what "(0)" means.

> I think this was an idea that originally came from me, prompted by
> what we already do for:

I agree that we have some existing behavior that's related to this, but
it's still messy, and I couldn't find any evidence that suggested that the
runtime lookup costs anything.  Typical subplans are going to deliver
long runs of tuples from the same target relation, so as long as we
maintain a one-element cache of the last lookup result, it's only about
one comparison per tuple most of the time.

> I think the real
> problem here is that the deparsing code isn't in on the secret.

Agreed; if we spent some more effort on that end of it, maybe we
could do something different here.  I'm not very sure what good
output would look like though.  A key advantage of tableoid is
that that's already a thing people know about.

regards, tom lane




Crash in BRIN minmax-multi indexes

2021-03-31 Thread Jaime Casanova
Hi,

Just found $SUBJECT involving time with time zone and a subselect. I
still don't have narrowed to the exact table/index minimal schema but
if you run this query on the regression database it will creash.

```
update public.brintest_multi set
  timetzcol = (select tz from generate_series('2021-01-01'::timestamp
with time zone, '2021-01-31', '5 days') tz limit 1)
;
```

attached a backtrace. Let me know if you need extra information.

--
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
set = {__val = {4194304, 140736618859920, 2, 6, 6570493, 
93847671406576, 4611686018427388799, 140140941466278, 0, 
281470681751456, 0, 0, 0, 0, 0, 0}}
pid = 
tid = 
ret = 
#1  0x7f751ad08535 in __GI_abort () at abort.c:79
save_stage = 1
act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, 
sa_mask = {__val = {0, 0, 0, 0, 0, 140140939223029, 
  2, 7004563297194213396, 7003431879952131129, 93847671406576, 
7003772732868633408, 0, 18123550304647508992, 
  140736618860160, 93847715200152, 140736618861024}}, sa_flags = 
-1658978320, sa_restorer = 0x555a9fba3c98}
sigs = {__val = {32, 0 }}
#2  0x555a9d872557 in ExceptionalCondition (conditionName=0x555a9d8ec6c3 
"delta >= 0", 
errorType=0x555a9d8ebf74 "FailedAssertion", fileName=0x555a9d8ebf60 
"brin_minmax_multi.c", lineNumber=2095) at assert.c:69
No locals.
#3  0x555a9d1e9933 in brin_minmax_multi_distance_timetz 
(fcinfo=0x7fffcc2c9250) at brin_minmax_multi.c:2095
delta = -678500
ta = 0x555a9fb99720
tb = 0x555a9fb9c768
#4  0x555a9d87c6f1 in FunctionCall2Coll (flinfo=0x555a9fbd7c88, 
collation=0, arg1=93847715157792, arg2=93847715170152)
at fmgr.c:1163
fcinfodata = {fcinfo = {flinfo = 0x555a9fbd7c88, context = 0x0, 
resultinfo = 0x0, fncollation = 0, isnull = false, 
nargs = 2, args = 0x7fffcc2c9270}, 
  fcinfo_data = "\210|\275\237ZU", '\000' , "\002\000 
\227\271\237ZU\000\000\000\000\000\000\000\000\000\000hǹ\237ZU\000\000\000\000\000@\317#\270A"}
fcinfo = 0x7fffcc2c9250
result = 4735574381219020800
__func__ = "FunctionCall2Coll"
#5  0x555a9d1e85b7 in build_distances (distanceFn=0x555a9fbd7c88, 
colloid=0, eranges=0x555a9fba3b50, neranges=10)
at brin_minmax_multi.c:1350
a1 = 93847715157792
a2 = 93847715170152
r = 4735574381219020800
i = 4
ndistances = 9
distances = 0x555a9fba3c68
#6  0x555a9d1e912f in compactify_ranges (bdesc=0x555a9fbd72a8, 
ranges=0x555a9fc02df8, max_values=32)
at brin_minmax_multi.c:1820
cmpFn = 0x555a9fbd7cc0
distanceFn = 0x555a9fbd7c88
eranges = 0x555a9fba3b50
neranges = 10
distances = 0x555a0010
ctx = 0x555a9fba3a30
oldctx = 0x555a9fb9b9f0
#7  0x555a9d1ea409 in brin_minmax_multi_serialize (bdesc=0x555a9fbd72a8, 
src=93847715589624, dst=0x555a9fb9bea8)
at brin_minmax_multi.c:2333
ranges = 0x555a9fc02df8
s = 0x555a9fb9d688
#8  0x555a9d1f0601 in brin_form_tuple (brdesc=0x555a9fbd72a8, blkno=0, 
tuple=0x555a9fb9bb10, size=0x7fffcc2c94e8)
at brin_tuple.c:165
datumno = 1
values = 0x555a9fb9c108
nulls = 0x555a9fb9c920
anynulls = true
rettuple = 0x89fc97d00
keyno = 15
idxattno = 15
phony_infomask = 0
phony_nullbitmap = 0x555a9fb9c9a0 "\177\177\177~\177\177\177\177"
len = 93847671475524
hoff = 2048
data_len = 93847715170728
i = 32767
untoasted_values = 0x555a9fb9c220
nuntoasted = 0
#9  0x555a9d1e09c1 in brininsert (idxRel=0x555a9fc97dd0, 
values=0x7fffcc2c9640, nulls=0x7fffcc2c9620, heaptid=0x555a9fb92ea8, 
heapRel=0x555a9fc83540, checkUnique=UNIQUE_CHECK_NO, indexUnchanged=false, 
indexInfo=0x555a9fbac900) at brin.c:281
lp = 0x7f751a1abfb8
origsz = 1592
newsz = 93847674317391
page = 0x7f751a1abf80 "\004"
origtup = 0x555a9fb9c9c0
newtup = 0x7fffcc2c9520
samepage = false
need_insert = true
off = 9
brtup = 0x7f751a1ac1d8
dtup = 0x555a9fb9bb10
pagesPerRange = 1
origHeapBlk = 0
heapBlk = 0
bdesc = 0x555a9fbd72a8
revmap = 0x555a9fb93248
buf = 16301
tupcxt = 0x555a9fb9b9f0
oldcxt = 0x555a9fbafa80
autosummarize = false
__func__ = "brininsert"
#10 0x555a9d280a1e in index_insert (indexRelation=0x555a9fc97dd0, 
values=0x7fffcc2c9640, isnull=0x7fffcc2c9620, 
heap_t_ctid=0x555a9fb92ea8, heapRelation=0x555a9fc83540, 
checkUnique=UNIQUE_CHECK_NO, indexUnchanged=false, 
indexInfo=0x555a9fbac900) at indexam.c:193
__func__ = "index_insert"
#11 0x555a9d493a2f in ExecInsertIndexTuples (res

Re: pg_amcheck contrib application

2021-03-31 Thread Mark Dilger



> On Mar 31, 2021, at 10:11 AM, Robert Haas  wrote:
> 
> On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
>  wrote:
>> I'm not looking at the old VACUUM FULL code, but my assumption is that if 
>> the xvac code were resurrected, then when a tuple is moved off by a VACUUM 
>> FULL, the old tuple and associated toast cannot be pruned until concurrent 
>> transactions end.  So, if amcheck is running more-or-less concurrently with 
>> the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM 
>> FULL's xid, it can check the toast associated with the moved off tuple after 
>> the VACUUM FULL commits.  If instead the VACUUM FULL xid was older than 
>> amcheck's xmin, then the toast is in danger of being vacuumed away.  So the 
>> logic in verify_heapam would need to change to think about this distinction. 
>>  We don't have to concern ourselves about that, because VACUUM FULL cannot 
>> be running, and so the xid for it must be older than our xmin, and hence the 
>> toast is unconditionally not safe to check.
>> 
>> I'm changing the comments back to how you had them, but I'd like to know why 
>> my reasoning is wrong.
> 
> Let's start by figuring out *whether* your reasoning is wrong. My
> assumption was that old-style VACUUM FULL would move tuples without
> retoasting. That is, if we decided to move a tuple from page 2 of the
> main table to page 1, we would just write the tuple into page 1,
> marking it moved-in, and on page 2 we would mark it moved-off. And
> that we would not examine or follow any TOAST pointers at all, so
> whatever TOAST entries existed would be shared between the two tuples.
> One tuple or the other would eventually die, depending on whether xvac
> went on to commit or abort, but either way the TOAST doesn't need
> updating because there's always exactly 1 remaining tuple using
> pointers to those TOAST values.
> 
> Your assumption seems to be the opposite, that the TOASTed values
> would be retoasted as part of VF. If that is true, then your idea is
> right.
> 
> Do you agree with this analysis? If so, we can check the code and find
> out which way it actually worked.

Actually, that makes a lot of sense without even looking at the old code.  I 
was implicitly assuming that the toast table was undergoing a VF also, and that 
the toast pointers in the main table tuples would be updated to point to the 
new location, so we'd be unable to follow the pointers to the old location 
without danger of the old location entries being vacuumed away.  But if the 
main table tuples get moved while keeping their toast pointers unaltered, then 
you don't have to worry about that, although you do have to worry that a VF of 
the main table doesn't help so much with toast table bloat.

We're only discussing this in order to craft the right comment for a bit of 
code with respect to a hypothetical situation in which VF gets resurrected, so 
I'm not sure this should be top priority, but I'm curious enough now to go read 
the old code

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







using extended statistics to improve join estimates

2021-03-31 Thread Tomas Vondra
Hi,

So far the extended statistics are applied only at scan level, i.e. when
estimating selectivity for individual tables. Which is great, but joins
are a known challenge, so let's try doing something about it ...

Konstantin Knizhnik posted a patch [1] using functional dependencies to
improve join estimates in January. It's an interesting approach, but as
I explained in that thread I think we should try a different approach,
similar to how we use MCV lists without extended stats. We'll probably
end up considering functional dependencies too, but probably only as a
fallback (similar to what we do for single-relation estimates).

This is a PoC demonstrating the approach I envisioned. It's incomplete
and has various limitations:

- no expression support, just plain attribute references
- only equality conditions
- requires MCV lists on both sides
- inner joins only

All of this can / should be relaxed later, of course. But for a PoC this
seems sufficient.

The basic principle is fairly simple, and mimics what eqjoinsel_inner
does. Assume we have a query:

  SELECT * FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b)

If we have MCV lists on (t1.a,t1.b) and (t2.a,t2.b) then we can use the
same logic as eqjoinsel_inner and "match" them together. If the MCV list
is "larger" - e.g. on (a,b,c) - then it's a bit more complicated, but
the general idea is the same.

To demonstrate this, consider a very simple example with a table that
has a lot of dependency between the columns:

==

CREATE TABLE t (a INT, b INT, c INT, d INT);
INSERT INTO t SELECT mod(i,100), mod(i,100), mod(i,100), mod(i,100)
  FROM generate_series(1,10) s(i);
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

CREATE STATISTICS s (mcv, ndistinct) ON a,b,c,d FROM t;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

ALTER STATISTICS s SET STATISTICS 1;
ANALYZE t;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b);

==

The results look like this:

- actual rows: 1
- estimated (no stats):  1003638
- estimated (stats, 100):  100247844
- estimated (stats, 10k):  1

So, in this case the extended stats help quite a bit, even with the
default statistics target.

However, there are other things we can do. For example, we can use
restrictions (at relation level) as "conditions" to filter the MCV lits,
and calculate conditional probability. This is useful even if there's
just a single join condition (on one column), but there are dependencies
between that column and the other filters. Or maybe when there are
filters between conditions on the two sides.

Consider for example these two queries:

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d < 25;

SELECT * FROM t t1 JOIN t t2 ON (t1.a = t2.a AND t1.b = t2.b)
 WHERE t1.c < 25 AND t2.d > 75;

In this particular case we know that (a = b = c = d) so the two filters
are somewhat redundant. The regular estimates will ignore that, but with
MCV we can actually detect that - when we combine the two MCV lists, we
essentially calculate MCV (a,b,t1.c,t2.d), and use that.

  Q1  Q2
- actual rows:  2500   0
- estimated (no stats):62158   60241
- estimated (stats, 100):   25047900   1
- estimated (stats, 10k):   2500   1

Obviously, the accuracy depends on how representative the MCV list is
(what fraction of the data it represents), and in this case it works
fairly nicely. A lot of the future work will be about handling cases
when it represents only part of the data.

The attached PoC patch has a number of FIXME and XXX, describing stuff I
ignored to keep it simple, possible future improvement. And so on.


regards


[1]
https://www.postgresql.org/message-id/flat/71d67391-16a9-3e5e-b5e4-8f7fd32cc...@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c
index d263ecf082..dca1e7d34e 100644
--- a/src/backend/optimizer/path/clausesel.c
+++ b/src/backend/optimizer/path/clausesel.c
@@ -157,6 +157,23 @@ clauselist_selectivity_ext(PlannerInfo *root,
 			&estimatedclauses, false);
 	}
 
+	/*
+	 * Try applying extended statistics to joins. There's not much we can
+	 * do to detect when this makes sense, but we can check that there are
+	 * join clauses, and that at least some of the rels have stats.
+	 *
+	 * XXX Isn't this mutualy exclusive with the preceding block which
+	 * calculates estimates for a single relation?
+	 */
+	if (use_extended_stats &&
+		statext_try_join_estimates(root, clauses, varRelid, jointype, sjinfo,
+		 estimatedclauses))
+	{
+		s1 *= statext_clauselist_join_selectivity(root, clauses, varRel

Re: making update/delete of inheritance trees scale better

2021-03-31 Thread Robert Haas
On Wed, Mar 31, 2021 at 1:24 PM Tom Lane  wrote:
> I agree that we have some existing behavior that's related to this, but
> it's still messy, and I couldn't find any evidence that suggested that the
> runtime lookup costs anything.  Typical subplans are going to deliver
> long runs of tuples from the same target relation, so as long as we
> maintain a one-element cache of the last lookup result, it's only about
> one comparison per tuple most of the time.

OK, that's pretty fair.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-03-31 Thread Robert Haas
On Wed, Mar 31, 2021 at 1:31 PM Mark Dilger
 wrote:
> Actually, that makes a lot of sense without even looking at the old code.  I 
> was implicitly assuming that the toast table was undergoing a VF also, and 
> that the toast pointers in the main table tuples would be updated to point to 
> the new location, so we'd be unable to follow the pointers to the old 
> location without danger of the old location entries being vacuumed away.  But 
> if the main table tuples get moved while keeping their toast pointers 
> unaltered, then you don't have to worry about that, although you do have to 
> worry that a VF of the main table doesn't help so much with toast table bloat.
>
> We're only discussing this in order to craft the right comment for a bit of 
> code with respect to a hypothetical situation in which VF gets resurrected, 
> so I'm not sure this should be top priority, but I'm curious enough now to go 
> read the old code

Right, well, we wouldn't be PostgreSQL hackers if we didn't spend lots
of time worrying about obscure details. Whether that's good software
engineering or mere pedantry is sometimes debatable.

I took a look at commit 0a469c87692d15a22eaa69d4b3a43dd8e278dd64,
which removed old-style VACUUM FULL, and AFAICS, it doesn't contain
any references to tuple deforming, varlena, HeapTupleHasExternal, or
anything else that would make me think it has the foggiest idea
whether the tuples it's moving around contain TOAST pointers, so I
think I had the right idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pg_amcheck contrib application

2021-03-31 Thread Mark Dilger



> On Mar 31, 2021, at 10:31 AM, Mark Dilger  
> wrote:
> 
> 
> 
>> On Mar 31, 2021, at 10:11 AM, Robert Haas  wrote:
>> 
>> On Wed, Mar 31, 2021 at 12:34 AM Mark Dilger
>>  wrote:
>>> I'm not looking at the old VACUUM FULL code, but my assumption is that if 
>>> the xvac code were resurrected, then when a tuple is moved off by a VACUUM 
>>> FULL, the old tuple and associated toast cannot be pruned until concurrent 
>>> transactions end.  So, if amcheck is running more-or-less concurrently with 
>>> the VACUUM FULL and has a snapshot xmin no newer than the xid of the VACUUM 
>>> FULL's xid, it can check the toast associated with the moved off tuple 
>>> after the VACUUM FULL commits.  If instead the VACUUM FULL xid was older 
>>> than amcheck's xmin, then the toast is in danger of being vacuumed away.  
>>> So the logic in verify_heapam would need to change to think about this 
>>> distinction.  We don't have to concern ourselves about that, because VACUUM 
>>> FULL cannot be running, and so the xid for it must be older than our xmin, 
>>> and hence the toast is unconditionally not safe to check.
>>> 
>>> I'm changing the comments back to how you had them, but I'd like to know 
>>> why my reasoning is wrong.
>> 
>> Let's start by figuring out *whether* your reasoning is wrong. My
>> assumption was that old-style VACUUM FULL would move tuples without
>> retoasting. That is, if we decided to move a tuple from page 2 of the
>> main table to page 1, we would just write the tuple into page 1,
>> marking it moved-in, and on page 2 we would mark it moved-off. And
>> that we would not examine or follow any TOAST pointers at all, so
>> whatever TOAST entries existed would be shared between the two tuples.
>> One tuple or the other would eventually die, depending on whether xvac
>> went on to commit or abort, but either way the TOAST doesn't need
>> updating because there's always exactly 1 remaining tuple using
>> pointers to those TOAST values.
>> 
>> Your assumption seems to be the opposite, that the TOASTed values
>> would be retoasted as part of VF. If that is true, then your idea is
>> right.
>> 
>> Do you agree with this analysis? If so, we can check the code and find
>> out which way it actually worked.
> 
> Actually, that makes a lot of sense without even looking at the old code.  I 
> was implicitly assuming that the toast table was undergoing a VF also, and 
> that the toast pointers in the main table tuples would be updated to point to 
> the new location, so we'd be unable to follow the pointers to the old 
> location without danger of the old location entries being vacuumed away. But 
> if the main table tuples get moved while keeping their toast pointers 
> unaltered, then you don't have to worry about that, although you do have to 
> worry that a VF of the main table doesn't help so much with toast table bloat.
> 
> We're only discussing this in order to craft the right comment for a bit of 
> code with respect to a hypothetical situation in which VF gets resurrected, 
> so I'm not sure this should be top priority, but I'm curious enough now to go 
> read the old code

Well, that's annoying.  The documentation of postgres 8.2 for vacuum full [1] 
says,

  Selects "full" vacuum, which may reclaim more space, but takes much longer 
and exclusively locks the table.

I read "exclusively locks" as meaning it takes an ExclusiveLock, but the code 
shows that it takes an AccessExclusiveLock.  I think the docs are pretty 
misleading here, though I understand that grammatically it is hard to say 
"accessively exclusively locks" or such.  But a part of my analysis was based 
on the reasoning that if VF only takes an ExclusiveLock, then there must be 
concurrent readers possible.  VF went away long enough ago that I had forgotten 
exactly how inconvenient it was.

[1] https://www.postgresql.org/docs/8.2/sql-vacuum.html

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: pg_amcheck contrib application

2021-03-31 Thread Robert Haas
On Wed, Mar 31, 2021 at 1:44 PM Mark Dilger
 wrote:
> I read "exclusively locks" as meaning it takes an ExclusiveLock, but the code 
> shows that it takes an AccessExclusiveLock.  I think the docs are pretty 
> misleading here, though I understand that grammatically it is hard to say 
> "accessively exclusively locks" or such.  But a part of my analysis was based 
> on the reasoning that if VF only takes an ExclusiveLock, then there must be 
> concurrent readers possible.  VF went away long enough ago that I had 
> forgotten exactly how inconvenient it was.

It kinda depends on what you mean by concurrent readers, because a
transaction that could start on Monday and acquire an XID, and then on
Tuesday you could run VACUUM FULL on relation "foo", and then on
Wednesday the transaction from before could get around to reading some
data from "foo". The two transactions are concurrent, in the sense
that the 3-day transaction was running before the VACUUM FULL, was
still running after VACUUM FULL, read the same pages that the VACUUM
FULL modified, and cares whether the XID from the VACUUM FULL
committed or aborted. But, it's not concurrent in the sense that you
never have a situation where the VACUUM FULL does some of its
modifications, then an overlapping transaction sees them, and then it
does the rest of them.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: pgbench - add pseudo-random permutation function

2021-03-31 Thread Fabien COELHO


Hello Dean,

OK, attached is an update making this change and simplifying the rotate 
code, which hopefully just leaves the question of what (if anything) to 
do with pg_erand48().


Yep. While looking at it, I have some doubts on this part:

 m = (uint64) (pg_erand48(random_state.xseed) * (mask + 1)) | 1;
 r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1));
 r = (uint64) (pg_erand48(random_state.xseed) * size);

I do not understand why the random values are multiplied by anything in 
the first place…


This one looks like a no-op :

   r = (uint64) (pg_erand48(random_state.xseed) * size);
   v = (v + r) % size;

   v = (v + r) % size
 = (v + rand * size) % size
 =? (v % size + rand * size % size) % size
 =? (v % size + 0) % size
 = v % size
 = v

I'm also skeptical about this one:

   r = (uint64) (pg_erand48(random_state.xseed) * (mask + 1));
   if (v <= mask)
  v = ((v * m) ^ r) & mask;

   v = ((v * m) ^ r) & mask
 = ((v * m) ^ r) % (mask+1)
 = ((v * m) ^ (rand * (mask+1))) % (mask+1)
 =? ((v * m) % (mask+1)) ^ (rand * (mask+1) % (mask+1))
 =? ((v * m) % (mask+1)) ^ (0)
 = (v * m) & mask

Or possibly I'm missing something obvious and I'm wrong with my 
arithmetic?


--
Fabien.

Re: "has_column_privilege()" issue with attnums and non-existent columns

2021-03-31 Thread Joe Conway

On 3/30/21 8:17 PM, Joe Conway wrote:

On 3/30/21 6:22 PM, Tom Lane wrote:

Joe Conway  writes:

Heh, I missed the forest for the trees it seems.
That version undid the changes fixing what Ian was originally complaining about.


Duh, right.  It would be a good idea for there to be a code comment
explaining this, because it's *far* from obvious.  Say like

* Check for column-level privileges first.  This serves in
* part as a check on whether the column even exists, so we
* need to do it before checking table-level privilege.


Will do.


My gripe about providing API-spec comments for the new aclchk.c
entry points still stands.  Other than that, I think it's good
to go.


Yeah, I was planning to put something akin to this in all four spots:
8<---
/*
   * Exported routine for checking a user's access privileges to a table
   *
   * Does the bulk of the work for pg_class_aclcheck(), and allows other
   * callers to avoid the missing relation ERROR when is_missing is non-NULL.
   */
AclResult
pg_class_aclcheck_ext(Oid table_oid, Oid roleid,
  AclMode mode, bool *is_missing)
...
8<---



Pushed that way.

Joe

--
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: ModifyTable overheads in generic plans

2021-03-31 Thread Tom Lane
Amit Langote  writes:
> [ v14-0002-Initialize-result-relation-information-lazily.patch ]

Needs YA rebase over 86dc90056.

regards, tom lane




Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Zhihong Yu
Hi,
In build_distances():

a1 = eranges[i].maxval;
a2 = eranges[i + 1].minval;

It seems there was overlap between the successive ranges, leading to
delta = -678500

FYI

On Wed, Mar 31, 2021 at 10:30 AM Jaime Casanova <
jcasa...@systemguards.com.ec> wrote:

> Hi,
>
> Just found $SUBJECT involving time with time zone and a subselect. I
> still don't have narrowed to the exact table/index minimal schema but
> if you run this query on the regression database it will creash.
>
> ```
> update public.brintest_multi set
>   timetzcol = (select tz from generate_series('2021-01-01'::timestamp
> with time zone, '2021-01-31', '5 days') tz limit 1)
> ;
> ```
>
> attached a backtrace. Let me know if you need extra information.
>
> --
> Jaime Casanova
> Director de Servicios Profesionales
> SYSTEMGUARDS - Consultores de PostgreSQL
>


Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Tomas Vondra
On 3/31/21 8:20 PM, Zhihong Yu wrote:
> Hi,
> In build_distances():
> 
>         a1 = eranges[i].maxval;
>         a2 = eranges[i + 1].minval;
> 
> It seems there was overlap between the successive ranges, leading to
> delta = -678500
> 

I've been unable to reproduce this, so far :-( How exactly did you
manage to reproduce it?


The thing is - how could there be an overlap? The way we build expanded
ranges that should not be possible, I think. Can you print the ranges at
the end of fill_expanded_ranges? That should shed some light on this.


FWIW I suspect those asserts on delta may be a bit problematic due to
rounding errors. And I found one issue in the inet distance function,
because apparently

test=# select '10.2.14.243/24'::inet < '10.2.14.231/24'::inet;
 ?column?
--
 f
(1 row)

but the delta formula currently ignores the mask. But that's a separate
issue.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: ALTER TABLE .. DETACH PARTITION CONCURRENTLY

2021-03-31 Thread Tom Lane
Alvaro Herrera  writes:
> I added that test as promised, and I couldn't find any problems, so I
> have pushed it.

Buildfarm testing suggests there's an issue under CLOBBER_CACHE_ALWAYS:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=trilobite&dt=2021-03-29%2018%3A14%3A24

specifically

diff -U3 
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
 
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
--- 
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/expected/detach-partition-concurrently-4.out
2021-03-29 20:14:21.258199311 +0200
+++ 
/home/buildfarm/trilobite/buildroot/HEAD/pgsql.build/src/test/isolation/output_iso/results/detach-partition-concurrently-4.out
  2021-03-30 18:54:34.272839428 +0200
@@ -324,6 +324,7 @@
 1  
 2  
 step s1insert: insert into d4_fk values (1);
+ERROR:  insert or update on table "d4_fk" violates foreign key constraint 
"d4_fk_a_fkey"
 step s1c: commit;
 
 starting permutation: s2snitch s1b s1s s2detach s1cancel s3vacfreeze s1s 
s1insert s1c

regards, tom lane




Re: row filtering for logical replication

2021-03-31 Thread Andres Freund
Hi,

As far as I can tell you have not *AT ALL* addressed that it is *NOT
SAFE* to evaluate arbitrary expressions from within an output
plugin. Despite that having been brought up multiple times.


> +static ExprState *
> +pgoutput_row_filter_prepare_expr(Node *rfnode, EState *estate)
> +{
> + ExprState  *exprstate;
> + Oid exprtype;
> + Expr   *expr;
> +
> + /* Prepare expression for execution */
> + exprtype = exprType(rfnode);
> + expr = (Expr *) coerce_to_target_type(NULL, rfnode, exprtype, BOOLOID, 
> -1, COERCION_ASSIGNMENT, COERCE_IMPLICIT_CAST, -1);
> +
> + if (expr == NULL)
> + ereport(ERROR,
> + (errcode(ERRCODE_CANNOT_COERCE),
> +  errmsg("row filter returns type %s that cannot 
> be coerced to the expected type %s",
> + format_type_be(exprtype),
> + format_type_be(BOOLOID)),
> +  errhint("You will need to rewrite the row 
> filter.")));
> +
> + exprstate = ExecPrepareExpr(expr, estate);
> +
> + return exprstate;
> +}
> +
> +/*
> + * Evaluates row filter.
> + *
> + * If the row filter evaluates to NULL, it is taken as false i.e. the change
> + * isn't replicated.
> + */
> +static inline bool
> +pgoutput_row_filter_exec_expr(ExprState *state, ExprContext *econtext)
> +{
> + Datum   ret;
> + boolisnull;
> +
> + Assert(state != NULL);
> +
> + ret = ExecEvalExprSwitchContext(state, econtext, &isnull);
> +
> + elog(DEBUG3, "row filter evaluates to %s (isnull: %s)",
> +  DatumGetBool(ret) ? "true" : "false",
> +  isnull ? "true" : "false");
> +
> + if (isnull)
> + return false;
> +
> + return DatumGetBool(ret);
> +}

> +/*
> + * Change is checked against the row filter, if any.
> + *
> + * If it returns true, the change is replicated, otherwise, it is not.
> + */
> +static bool
> +pgoutput_row_filter(Relation relation, HeapTuple oldtuple, HeapTuple 
> newtuple, List *rowfilter)
> +{
> + TupleDesc   tupdesc;
> + EState *estate;
> + ExprContext *ecxt;
> + MemoryContext oldcxt;
> + ListCell   *lc;
> + boolresult = true;
> +
> + /* Bail out if there is no row filter */
> + if (rowfilter == NIL)
> + return true;
> +
> + elog(DEBUG3, "table \"%s.%s\" has row filter",
> +  
> get_namespace_name(get_rel_namespace(RelationGetRelid(relation))),
> +  get_rel_name(relation->rd_id));
> +
> + tupdesc = RelationGetDescr(relation);
> +
> + estate = create_estate_for_relation(relation);
> +
> + /* Prepare context per tuple */
> + ecxt = GetPerTupleExprContext(estate);
> + oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
> + ecxt->ecxt_scantuple = ExecInitExtraTupleSlot(estate, tupdesc, 
> &TTSOpsHeapTuple);
> + MemoryContextSwitchTo(oldcxt);
> +
> + ExecStoreHeapTuple(newtuple ? newtuple : oldtuple, 
> ecxt->ecxt_scantuple, false);
> + /*
> +  * If the subscription has multiple publications and the same table has 
> a
> +  * different row filter in these publications, all row filters must be
> +  * matched in order to replicate this change.
> +  */
> + foreach(lc, rowfilter)
> + {
> + Node   *rfnode = (Node *) lfirst(lc);
> + ExprState  *exprstate;
> +
> + /* Prepare for expression execution */
> + exprstate = pgoutput_row_filter_prepare_expr(rfnode, estate);
> +
> + /* Evaluates row filter */
> + result = pgoutput_row_filter_exec_expr(exprstate, ecxt);

Also, this still seems like an *extremely* expensive thing to do for
each tuple. It'll often be *vastly* faster to just send the data than to
the other side.

This just cannot be done once per tuple. It has to be cached.

I don't see how these issues can be addressed in the next 7 days,
therefore I think this unfortunately needs to be marked as returned with
feedback.

Greetings,

Andres Freund




Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Martin Jonsson
 SAP has implemented something similar all across their stack. In their HANA 
database, application platform ABAP and also their cloud. So clearly they find 
it very popular:-) It is called CDS (Core Data Services) views. Here is a quick 
overview:
- Superset of SQL to declare views and associations between views. They are 
views with sort of named joins. The code is parsed and stored as a normal SQL 
view as well as metadata. Note this metadata is not technically part of the 
database SQL layer but rather the database application layer. The user normally 
sees no difference.
- Superset of SQL to query in a very similar way as described above with paths. 
This is parsed to normal SQL with joins, taking into consideration above 
metadata. Can only work on the above views.
This has obvious limitations, most mentioned earlier in this thread. 
Specifically, join types are limited. Still it eases the pain considerably of 
writing queries. The SAP system I work on now has 400K tables and over 1 
million fields. Most keys are composite.. One needs to be a super hero to keep 
that data model in memory 
This might be an extreme case but I'm sure there are other use cases. SAP 
technical users are actually quite happy to work with it since, in my humble 
opinion, it is in a way SQL light. The nice data model parts without the pesky 
complicated stuff.
It is not really an ORM but it makes ORM work significantly simpler by keeping 
the metadata on the database. The hierarchical CRUD stuff of ORM legend is 
squarely out of scope. 
I've been seeing this type of question appearing regularly in this forum and 
maybe this SAP way holds water as a solution? In that case, the work should 
probably be in user land but close to the database. Maybe as as extension with 
a SQL preprocessor? Much of the grammar and parsing work is already available, 
at least as inspiration.
Martin
 
 






Le lundi 29 mars 2021, 12:48:58 UTC+2, Pavel Stehule 
 a écrit :  
 
 

po 29. 3. 2021 v 12:01 odesílatel Joel Jacobson  napsal:

On Sun, Mar 28, 2021, at 16:04, Tom Lane wrote:

I'm imagining a syntax in which
you give the constraint name instead of the column name.  Thought
experiment: how could the original syntax proposal make any use of
a multi-column foreign key?


Thanks for coming up with this genius idea.

At first I didn't see the beauty of it; I wrongly thought the constraint name 
needed to be
unique per schema, but I realize we could just use the foreign table's name
as the constraint name, which will allow a nice syntax:

SELECT DISTINCT order_details.orders.customers.company_name
FROM order_details
WHERE order_details.products.product_name = 'Chocolade';


This syntax is similar to Oracle's object references (this is example from 
thread from Czech Postgres list last week)

Select e.last_name employee,
       e.department_ref.department_name department,
       e.department_ref.manager_ref.last_name dept_manager
>From employees_obj e
where e.initials() like 'K_';
I see few limitations: a) there is not support for outer join, b) there is not 
support for aliasing - and it probably doesn't too nice, when you want to 
returns more (but not all) columns
Regards
Pavel





Given this data model:

CREATE TABLE customers (
customer_id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
company_name text,
PRIMARY KEY (customer_id)
);

CREATE TABLE orders (
order_id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
customer_id bigint NOT NULL,
PRIMARY KEY (order_id),
CONSTRAINT customers FOREIGN KEY (customer_id) REFERENCES customers
);

CREATE TABLE products (
product_id bigint NOT NULL GENERATED ALWAYS AS IDENTITY,
product_name text NOT NULL,
PRIMARY KEY (product_id)
);

CREATE TABLE order_details (
order_id bigint NOT NULL,
product_id bigint NOT NULL,
PRIMARY KEY (order_id, product_id),
CONSTRAINT orders FOREIGN KEY (order_id) REFERENCES orders,
CONSTRAINT products FOREIGN KEY (product_id) REFERENCES products
);


> Not saying I think this suggestion is a good idea, though. We've seen
> many frameworks that hide joins, and the results are ... less than
> universally good.

Yeah, I'm pretty much not sold on this idea either.  I think it would
lead to the same problems we see with ORMs, namely that people write
queries that are impossible to execute efficiently and then blame
the database for their poor choice of schema.


I think this concern is valid for the original syntax,
but I actually think the idea on using foreign key constraint names
effectively solves an entire class of query writing bugs.

Users writing queries using this syntax are guaranteed to be aware
of the existence of the foreign keys, otherwise they couldn't write
the query this way, since they must use the foreign key
constraint names in the path expression.

This ensures it's not possible to produce a nonsensical JOIN
on the wrong columns, a problem for which traditional JOINs
have no means to protect against.

Even with foreign keys, indexes could of course be missi

Re: Crash in record_type_typmod_compare

2021-03-31 Thread Andres Freund
Hi,

On 2021-03-31 13:10:50 -0400, Tom Lane wrote:
> Sait Talha Nisanci  writes:
> >> We should probably do 
> >> HASH_ENTER
> >>  only after we have a valid entry so that we don't end up with a NULL 
> >> entry in the cache even if an intermediate error happens. I will share a 
> >> fix in this thread soon.
> > I am attaching this patch.
> 
> I see the hazard, but this seems like an expensive way to fix it,
> as it forces two hash searches for every insertion.

Obviously not free - but at least it'd be overhead only in the insertion
path. And the bucket should still be in L1 cache for the second
insertion...

I doubt that the cost of a separate HASH_ENTER is all that significant
compared to find_or_make_matching_shared_tupledesc/CreateTupleDescCopy?

We do the separate HASH_FIND/HASH_ENTER in plenty other places that are
much hotter than assign_record_type_typmod(),
e.g. RelationIdGetRelation().

It could even be that the additional branches in the comparator would
end up costing more in the aggregate...


> Couldn't we just
> teach record_type_typmod_compare to say "not equal" if it sees a
> null tupdesc?

Won't that lead to an accumulation of dead hash table entries over time?

It also just seems quite wrong to have hash table entries that cannot
ever be found via HASH_FIND/HASH_REMOVE, because
record_type_typmod_compare() returns false once there's a NULL in there.

Greetings,

Andres Freund




Re: Redundant errdetail prefix "The error was:" in some logical replication messages

2021-03-31 Thread Tom Lane
Peter Smith  writes:
> PSA version 2 of this patch which adopts your suggestions.

LGTM, pushed.

regards, tom lane




Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Joel Jacobson
On Wed, Mar 31, 2021, at 19:16, Julien Rouhaud wrote:
> On Wed, Mar 31, 2021 at 5:19 PM Joel Jacobson  > wrote:
> >
> > If using the -> notation, you would only need to manually
> > inspect the tables involved in the remaining JOINs;
> > since you could be confident all uses of -> cannot affect cardinality.
> 
> Talking about that, do you have some answers to the points raised in
> my previous mail, which is how it's supposed to behave when a table is
> both join using your "->" syntax and a plain JOIN, how to join the
> same table multiple time using this new syntax, and how to add
> predicates to the join clause using  this new syntax.

It's tricky, I don't see a good solution.

My original proposal aimed to improve syntax conciseness.
While this would be nice, I see much more potential value in Tom's idea
of somehow making use of foreign key constrain names.

Instead of trying to hack it into the  part of a query,
maybe it's more fruitful to see if we can find a way to integrate it into the 
.

Perhaps something along the lines of what Vik suggested earlier:
> FROM a JOIN b WITH a_b_fk

The problem I have with the above is "b" is redundant information,
since the foreign key is always between two specific tables,
and given "a" and "a_b_fk" we know we are joining "b".

I would prefer a new chainable binary operator.

Pavel raised some concerns about using "->" since used by the standard already,
but perhaps it is less of a problem when only used in the ?
Otherwise we could use something else entirely.

Here comes some ideas on  syntax.

With default foreign key constraint names:

SELECT DISTINCT customers.company_name
FROM order_details->order_details_product_id_fkey AS products
JOIN order_details->order_details_order_id_fkey->orders_customer_id_fkey AS 
customers
WHERE products.product_name = 'Chocolade';

In a PostgreSQL-only environment, foreign keys could be renamed:

ALTER TABLE orders RENAME CONSTRAINT orders_customer_id_fkey TO customers;
ALTER TABLE order_details RENAME CONSTRAINT order_details_order_id_fkey TO 
orders;
ALTER TABLE order_details RENAME CONSTRAINT order_details_product_id_fkey TO 
products;

Then we would get:

SELECT DISTINCT customers.company_name
FROM order_details->products
JOIN order_details->orders->customers
WHERE products.product_name = 'Chocolade';

Which would be the same thing as:

SELECT DISTINCT customers.company_name
FROM order_details
JOIN order_details->products
JOIN order_details->orders
JOIN orders->customers
WHERE products.product_name = 'Chocolade';

Type of join can be specified as well as aliases, just like normal:

SELECT DISTINCT c.company_name
FROM order_details AS od
JOIN od->products AS p
FULL JOIN od->orders AS o
LEFT JOIN o->customers AS c
WHERE p.product_name = 'Chocolade';

(FULL and LEFT join makes no sense in this example, but just to illustrate join 
types works just like normal)

I don't know how challenging this would be to integrate into the grammar though.
Here are some other ideas which might be easier to parse:

SELECT DISTINCT customers.company_name
FROM order_details->products
JOIN ON order_details->orders->customers
WHERE products.product_name = 'Chocolade';

SELECT DISTINCT customers.company_name
FROM order_details->products
JOIN USING order_details->orders->customers
WHERE products.product_name = 'Chocolade';

SELECT DISTINCT customers.company_name
FROM order_details->products
JOIN WITH order_details->orders->customers
WHERE products.product_name = 'Chocolade';

More syntax ideas?

Semantic ideas:

* When chaining, all joins on the chain would be made of the same type.
* To use different join types, you would write a separate join.
* All tables joined in the chain, would be accessible in the , via 
the names of the foreign key constraints.
* Only the last link on the chain can be given an alias. If you want to alias 
something in the middle, split the chain into two separate joins (, so that the 
one in the middle becomes the last one, which can then be given an alias.)
 
Thoughts?

/Joel


Re: Crash in record_type_typmod_compare

2021-03-31 Thread Tom Lane
Andres Freund  writes:
> On 2021-03-31 13:10:50 -0400, Tom Lane wrote:
>> Couldn't we just
>> teach record_type_typmod_compare to say "not equal" if it sees a
>> null tupdesc?

> Won't that lead to an accumulation of dead hash table entries over time?

Yeah, if you have repeat failures there, which doesn't seem very likely.
Still, I take your point that we're doing it the first way in other
places, so maybe inventing a different way here isn't good.

regards, tom lane




Re: multi-install PostgresNode fails with older postgres versions

2021-03-31 Thread Mark Dilger



> On Mar 30, 2021, at 5:41 PM, Mark Dilger  wrote:
> 
> 1) PostgresNode::init() doesn't work for older server versions


PostgresNode::start() doesn't work for servers older than version 10, either.  
If I hack that function to sleep until the postmaster.pid file exists, it 
works, but that is really ugly and is just to prove to myself that it is a 
timing issue.  There were a few commits in the version 10 development cycle 
(cf, commit f13ea95f9e473a43ee4e1baeb94daaf83535d37c) which changed how pg_ctl 
works, though I haven't figured out yet exactly what the interaction with 
PostgresNode would be.  I'll keep looking.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: multi-install PostgresNode fails with older postgres versions

2021-03-31 Thread Alvaro Herrera
On 2021-Mar-31, Mark Dilger wrote:

> PostgresNode::start() doesn't work for servers older than version 10,
> either.  If I hack that function to sleep until the postmaster.pid
> file exists, it works, but that is really ugly and is just to prove to
> myself that it is a timing issue.  There were a few commits in the
> version 10 development cycle (cf, commit
> f13ea95f9e473a43ee4e1baeb94daaf83535d37c) which changed how pg_ctl
> works, though I haven't figured out yet exactly what the interaction
> with PostgresNode would be.  I'll keep looking.

Do you need to do "pg_ctl -w" perhaps?

-- 
Álvaro Herrera   Valdivia, Chile




Re: Crash in record_type_typmod_compare

2021-03-31 Thread Andres Freund
Hi,

On 2021-03-31 13:26:34 +, Sait Talha Nisanci wrote:
> diff --git a/src/backend/utils/cache/typcache.c 
> b/src/backend/utils/cache/typcache.c
> index 4915ef5934..4757e8fa80 100644
> --- a/src/backend/utils/cache/typcache.c
> +++ b/src/backend/utils/cache/typcache.c
> @@ -1970,18 +1970,16 @@ assign_record_type_typmod(TupleDesc tupDesc)
>   CreateCacheMemoryContext();
>   }
>  
> - /* Find or create a hashtable entry for this tuple descriptor */
> + /* Find a hashtable entry for this tuple descriptor */
>   recentry = (RecordCacheEntry *) hash_search(RecordCacheHash,
>   
> (void *) &tupDesc,
> - 
> HASH_ENTER, &found);
> + 
> HASH_FIND, &found);
>   if (found && recentry->tupdesc != NULL)
>   {
>   tupDesc->tdtypmod = recentry->tupdesc->tdtypmod;
>   return;
>   }
>  
> - /* Not present, so need to manufacture an entry */
> - recentry->tupdesc = NULL;
>   oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
>  
>   /* Look in the SharedRecordTypmodRegistry, if attached */
> @@ -1995,6 +1993,10 @@ assign_record_type_typmod(TupleDesc tupDesc)
>   }
>   ensure_record_cache_typmod_slot_exists(entDesc->tdtypmod);
>   RecordCacheArray[entDesc->tdtypmod] = entDesc;
> + /* Not present, so need to manufacture an entry */
> + recentry = (RecordCacheEntry *) hash_search(RecordCacheHash,
> + 
> (void *) &tupDesc,
> + 
> HASH_ENTER, NULL);
>   recentry->tupdesc = entDesc;

ISTM that the ensure_record_cache_typmod_slot_exists() should be moved
to before find_or_make_matching_shared_tupledesc /
CreateTupleDescCopy. Otherwise we can leak if the CreateTupleDescCopy()
succeeds but ensure_record_cache_typmod_slot_exists() fails. Conversely,
if ensure_record_cache_typmod_slot_exists() succeeds, but
CreateTupleDescCopy() we won't leak, since the former is just
repalloc()ing allocations.

Greetings,

Andres Freund




Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Joel Jacobson
On Wed, Mar 31, 2021, at 21:32, Joel Jacobson wrote:
> SELECT DISTINCT customers.company_name
> FROM order_details->products
> JOIN order_details->orders->customers
> WHERE products.product_name = 'Chocolade';

Hm, maybe the operator shouldn't be allowed directly after FROM, but only used 
with a join:

SELECT DISTINCT customers.company_name
FROM order_details
JOIN order_details->orders->customers
JOIN order_details->products
WHERE products.product_name = 'Chocolade';

/Joel

Re: multi-install PostgresNode fails with older postgres versions

2021-03-31 Thread Andrew Dunstan


On 3/31/21 3:48 PM, Alvaro Herrera wrote:
> On 2021-Mar-31, Mark Dilger wrote:
>
>> PostgresNode::start() doesn't work for servers older than version 10,
>> either.  If I hack that function to sleep until the postmaster.pid
>> file exists, it works, but that is really ugly and is just to prove to
>> myself that it is a timing issue.  There were a few commits in the
>> version 10 development cycle (cf, commit
>> f13ea95f9e473a43ee4e1baeb94daaf83535d37c) which changed how pg_ctl
>> works, though I haven't figured out yet exactly what the interaction
>> with PostgresNode would be.  I'll keep looking.
> Do you need to do "pg_ctl -w" perhaps?



Probably. The buildfarm does this unconditionally and has done for a
very long time, so maybe we don't need a version test for it.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: multi-install PostgresNode fails with older postgres versions

2021-03-31 Thread Mark Dilger



> On Mar 31, 2021, at 1:05 PM, Andrew Dunstan  wrote:
> 
> 
> On 3/31/21 3:48 PM, Alvaro Herrera wrote:
>> On 2021-Mar-31, Mark Dilger wrote:
>> 
>>> PostgresNode::start() doesn't work for servers older than version 10,
>>> either.  If I hack that function to sleep until the postmaster.pid
>>> file exists, it works, but that is really ugly and is just to prove to
>>> myself that it is a timing issue.  There were a few commits in the
>>> version 10 development cycle (cf, commit
>>> f13ea95f9e473a43ee4e1baeb94daaf83535d37c) which changed how pg_ctl
>>> works, though I haven't figured out yet exactly what the interaction
>>> with PostgresNode would be.  I'll keep looking.
>> Do you need to do "pg_ctl -w" perhaps?
> 
> 
> 
> Probably. The buildfarm does this unconditionally and has done for a
> very long time, so maybe we don't need a version test for it.

I put a version test for this and it works for me.  I guess you could do it 
unconditionally, if you want, but the condition is just:

-   TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile,
+   TestLib::system_or_bail('pg_ctl',
+   $self->older_than_version('10') ? '-w' : (),
+   '-D', $pgdata, '-l', $logfile,
'restart');


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







RFC: Table access methods and scans

2021-03-31 Thread Mats Kindahl
Hi all,

I started looking into how table scans are handled for table access
methods and have discovered a few things that I find odd. I cannot
find any material regarding why this particular choice was made (if
anybody has pointers, I would be very grateful).

I am quite new to PostgreSQL so forgive me if my understanding of the
code below is wrong and please clarify what I have misunderstood.

I noted that `scan_begin` accepts a `ScanKey` and my *guess* was that
the intention for adding this to the interface was to support primary
indexes for table access methods (the comment is a little vague, but
it seems to point to that). However, looking at where `scan_begin` is
called from, I see that it is called from the following methods in
`tableam.h`:

- `table_beginscan` is always called using zero scan keys and NULL.
- `table_beginscan_strat` is mostly called with zero keys and NULL,
  with the exception of `systable_beginscan`, which is only for system
  tables. It does use this feature.
- `table_beginscan_bm` is only called with zero keys and NULL.
- `table_beginscan_sampling` is only called with zero keys and NULL.
- `table_beginscan_tid` calls `scan_begin` with zero keys and NULL.
- `table_beginscan_analyze` calls `scan_begin` with zero keys and NULL.
- `table_beginscan_catalog` is called with more than one key, but
  AFACT this is only for catalog tables.
- `table_beginscan_parallel` calls `scan_begin` with zero keys and NULL.

I draw the conclusion that the scan keys only make sense for a table
access method for the odd case where it is used for a system tables or
catalog tables, so for all practical purposes the scan key cannot be
used to implement a primary index for general tables.

As an example of how this is useful, I noticed the work by Heikki and
Ashwin [1], where they return a `TableScanDesc` that contains
information about what columns to scan, which looks very useful. Since
the function `table_beginscan` in `src/include/access/tableam.h`
accept a `ScanKey` as input, this is (AFAICT) what Heikki and Ashwin
was exploiting to create a specialized scan for a columnar store.

Another example of where this can be useful is to optimize access
during a sequential scan when you can handle some specific scans very
efficiently and can "skip ahead" many tuples if you know what is being
looked for instead of filtering "late". Two examples of where this
could be useful are:

- An access method that reads data from a remote system and doesn't want
  to transfer all tuples unless necessary.
- Some sort of log-structured storage with Bloom filters that allows
  you to quickly skip suites that do not have a key.

Interestingly enough, `ScanKey` is generated for `IndexScan` and I
think that the same approach could be used for sequential scans: pick
out the quals that can be used for filtering and offer them to the
table access method through the `scan_begin` callback.

Thoughts around this?

Best wishes,
Mats Kindahl

[1] 
https://www.postgresql-archive.org/Zedstore-compressed-in-core-columnar-storage-tp6081536.html




Re: libpq debug log

2021-03-31 Thread Tom Lane
"'alvhe...@alvh.no-ip.org'"  writes:
> So crake failed.  The failure is that it doesn't print the DataRow
> messages.  That's quite odd.  We see this in the trace log:

I think this is a timing problem that's triggered (on some machines)
by force_parallel_mode = regress.  Looking at spurfowl's latest
failure of this type, the postmaster log shows

2021-03-31 14:34:54.982 EDT [18233:15] 001_libpq_pipeline.pl LOG:  execute 
: SELECT 1.0/g FROM generate_series(3, -1, -1) g
2021-03-31 14:34:54.992 EDT [18234:1] ERROR:  division by zero
2021-03-31 14:34:54.992 EDT [18234:2] STATEMENT:  SELECT 1.0/g FROM 
generate_series(3, -1, -1) g
2021-03-31 14:34:54.993 EDT [18233:16] 001_libpq_pipeline.pl ERROR:  division 
by zero
2021-03-31 14:34:54.993 EDT [18233:17] 001_libpq_pipeline.pl STATEMENT:  SELECT 
1.0/g FROM generate_series(3, -1, -1) g
2021-03-31 14:34:54.995 EDT [18216:4] LOG:  background worker "parallel worker" 
(PID 18234) exited with exit code 1
2021-03-31 14:34:54.995 EDT [18233:18] 001_libpq_pipeline.pl LOG:  could not 
send data to client: Broken pipe
2021-03-31 14:34:54.995 EDT [18233:19] 001_libpq_pipeline.pl FATAL:  connection 
to client lost

We can see that the division by zero occurred in a parallel worker.
My theory is that in parallel mode, it's uncertain whether the
error will be reported before or after the "preceding" successful
output rows.  So you need to disable parallelism to make this
test case stable.

regards, tom lane




Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Isaac Morland
On Wed, 31 Mar 2021 at 15:32, Joel Jacobson  wrote:

> On Wed, Mar 31, 2021, at 19:16, Julien Rouhaud wrote:
>
> On Wed, Mar 31, 2021 at 5:19 PM Joel Jacobson  wrote:
> >
> > If using the -> notation, you would only need to manually
> > inspect the tables involved in the remaining JOINs;
> > since you could be confident all uses of -> cannot affect cardinality.
>
> Talking about that, do you have some answers to the points raised in
> my previous mail, which is how it's supposed to behave when a table is
> both join using your "->" syntax and a plain JOIN, how to join the
> same table multiple time using this new syntax, and how to add
> predicates to the join clause using  this new syntax.
>
>
> It's tricky, I don't see a good solution.
>
> My original proposal aimed to improve syntax conciseness.
> While this would be nice, I see much more potential value in Tom's idea
> of somehow making use of foreign key constrain names.
>

Maybe I have a different proposal in mind than anybody else, but I don't
think there is a problem with multiple joins to the same table. If the
joins are via the same constraint, then a single join is enough, and if
they are via different constraints, the constraints have unique names.

I think if TA is a table with a foreign key constraint CB to another table
TB, then the hypothetical expression:

TA -> CB

really just means:

(select TB from TB where (TB.[primary key columns) = (TA.[source columns of
constraint CB]))

You can then add .fieldname to get the required fieldname. The issue is
that writing it this way is hopelessly verbose, but the short form is fine.
The query planner also needs to be guaranteed to collapse multiple
references through the same constraint to a single actual join (and then
take all the multiple fields requested).

If TA is a table with a foreign key constraint CB to TB, which has a
foreign key constraint CC to TC, then this expression:

TA -> CB -> CC

just means, by the same definition (except I won't expand it fully, only
one level):

(select TC from TC where (TC.[primary key columns) = ((TA -> CB).[source
columns of constraint CC]))

Which reminds me, I often find myself wanting to write something like
a.(f1, f2, f3) = b.(f1, f2, f3) rather than (a.f1, a.f2, a.f3) = (b.f1,
b.f2, b.f3). But that's another story.


Re: Add docs stub for recovery.conf

2021-03-31 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Stephen Frost (sfr...@snowman.net) wrote:
> > Awesome, attached is just a rebase (not that anything really changed).
> > Unless someone wants to speak up, I'll commit this soonish (hopefully
> > tomorrow, but at least sometime later this week).
> 
> Alright, as this took a bit more playing with to work cleanly in the
> back-branches, I'm putting it back out there with full patches for all
> the back-branches, in case anyone sees anything I missed, but I think I
> got it all right and the docs build for me and at least going through
> all the new pages, everything looks good to me.
> 
> Naturally, only included the appropriate pieces in each of the back
> branches (v10 got the xlog -> WAL changes, v11 had the same, v12 had
> those plus the recovery.conf changes, as did v13 and HEAD).
> 
> Once these all go in, I'll update the default roles patch as discussed
> elsewhere and backpatch that too.  If there's other things we've done
> that would be good to include here, I'd be happy to work with anyone
> who's interested in putting in the effort to add more.  For now, this
> seems like a pretty good set though.
> 
> Unless there's anything further, will commit these soon.

And done.

Thanks all!

Stephen


signature.asc
Description: PGP signature


Re: multi-install PostgresNode fails with older postgres versions

2021-03-31 Thread Andrew Dunstan


On 3/30/21 8:52 PM, Michael Paquier wrote:
> On Tue, Mar 30, 2021 at 08:44:26PM -0400, Andrew Dunstan wrote:
>> Yeah, it should be validated. All things considered I think just calling
>> 'pg_config --version' is probably the simplest validation, and likely to
>> be sufficient.
>>
>> I'll try to come up with something tomorrow.
> There is already TestLib::check_pg_config().  Shouldn't you leverage
> that with PG_VERSION_NUM or equivalent?



TBH, TestLib::check_pg_config looks like a bit of a wart, and I would be
tempted to remove it. It's the only Postgres-specific thing in
TestLib.pm I think. It's only used in one place AFAICT
(src/test/ssl/t/002_scram.pl) so we could just remove it and inline the
code.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: MultiXact\SLRU buffers configuration

2021-03-31 Thread Andrey Borodin


> 29 марта 2021 г., в 11:26, Andrey Borodin  написал(а):
> 
> My TODO list:
> 1. Try to break patch set v13-[0001-0004]
> 2. Think how to measure performance of linear search versus hash search in 
> SLRU buffer mapping.

Hi Thomas!
I'm still doing my homework. And to this moment all my catch is that 
"utils/dynahash.h" is not necessary.

I'm thinking about hashtables and measuring performance near optimum of linear 
search does not seem a good idea now.
It's impossible to prove that difference is statistically significant on all 
platforms. But even on one platform measurements are just too noisy.

Shared buffers lookup table is indeed very similar to this SLRU lookup table. 
And it does not try to use more memory than needed. I could not find 
pgbench-visible impact of growing shared buffer lookup table. Obviously, 
because it's not a bottleneck on regular workload. And it's hard to guess 
representative pathological workload.

In fact, this thread started with proposal to use reader-writer lock for multis 
(instead of exclusive lock), and this proposal encountered same problem. It's 
very hard to create stable reproduction of pathological workload when this lock 
is heavily contented. Many people observed the problem, but still there is no 
open repro.

I bet it will be hard to prove that simplehash is any better then HTAB. But if 
it is really better, shared buffers could benefit from the same technique.

I think its just fine to use HTAB with normal size, as long as shared buffers 
do so. But there we allocate slightly more space InitBufTable(NBuffers + 
NUM_BUFFER_PARTITIONS). Don't we need to allocate nslots + 1 ? It seems that we 
always do SlruMappingRemove() before SlruMappingAdd() and it is not necessary.

Thanks!

Best regards, Andrey Borodin.



Re: Default role -> Predefined role

2021-03-31 Thread Stephen Frost
Greetings,

* Daniel Gustafsson (dan...@yesql.se) wrote:
> > On 20 Nov 2020, at 22:13, Stephen Frost  wrote:
> > Attached is a patch to move from 'default role' terminology to
> > 'predefined role' in the documentation.  In the code, I figured it made
> > more sense to avoid saying either one and instead opted for just
> > 'ROLE_NAME_OF_ROLE' as the #define, which we were very close to (one
> > removing the 'DEFAULT' prefix) but didn't actually entirely follow.
> 
> Predefined is a much better terminology than default for these roles, makes 
> the
> concept a lot clearer.  +1 on this patch.

Alright, now that 3b0c647bbfc52894d979976f1e6d60e40649bba7 has gone in,
we can come back to the topic of renaming default roles to predefined
roles as of v14.

Attached is a patch which does that and adds a section to the obsolete
appendix which should result in the existing links to the default roles
page continuing to work (but going to the page explaining that they've
been renamed to predefined roles) once v14 is released with the update.

Unless there's anything further on this, I'll plan to push in the next
day or so.

Thanks!

Stephen
From 189758e7ff55f45cc6d9ec4427f6b1dd4b5391e6 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 31 Mar 2021 16:58:27 -0400
Subject: [PATCH] Rename Default Roles to Predefined Roles

The term 'default roles' wasn't quite apt as these roles aren't able to
be modified or removed after installation, so rename them to be
'Predefined Roles' instead, adding an entry into the newly added
Obsolete Appendix to help users of current releases find the new
documentation.

Bruce Momjian and Stephen Frost

Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org
and https://www.postgresql.org/message-id/20201120211304.gg16...@tamriel.snowman.net
---
 contrib/adminpack/adminpack.c |  7 --
 contrib/file_fdw/file_fdw.c   |  4 ++--
 .../pg_stat_statements/pg_stat_statements.c   |  2 +-
 contrib/pgrowlocks/pgrowlocks.c   |  2 +-
 .../sgml/appendix-obsolete-default-roles.sgml | 22 +++
 doc/src/sgml/appendix-obsolete.sgml   |  1 +
 doc/src/sgml/file-fdw.sgml|  4 ++--
 doc/src/sgml/filelist.sgml|  1 +
 doc/src/sgml/monitoring.sgml  |  2 +-
 doc/src/sgml/ref/copy.sgml|  2 +-
 doc/src/sgml/user-manag.sgml  | 18 +++
 src/backend/commands/copy.c   |  6 ++---
 src/backend/commands/user.c   |  6 ++---
 src/backend/replication/walreceiver.c |  2 +-
 src/backend/replication/walsender.c   |  2 +-
 src/backend/storage/ipc/procarray.c   |  2 +-
 src/backend/storage/ipc/signalfuncs.c |  2 +-
 src/backend/utils/adt/acl.c   |  4 ++--
 src/backend/utils/adt/dbsize.c|  4 ++--
 src/backend/utils/adt/genfile.c   |  7 --
 src/backend/utils/adt/pgstatfuncs.c   |  2 +-
 src/backend/utils/misc/guc.c  | 14 ++--
 src/include/catalog/pg_authid.dat | 18 +++
 23 files changed, 82 insertions(+), 52 deletions(-)
 create mode 100644 doc/src/sgml/appendix-obsolete-default-roles.sgml

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index c3c5e03945..48c1746910 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -79,10 +79,13 @@ convert_and_check_filename(text *arg)
 	 * files on the server as the PG user, so no need to do any further checks
 	 * here.
 	 */
-	if (is_member_of_role(GetUserId(), DEFAULT_ROLE_WRITE_SERVER_FILES))
+	if (is_member_of_role(GetUserId(), ROLE_PG_WRITE_SERVER_FILES))
 		return filename;
 
-	/* User isn't a member of the default role, so check if it's allowable */
+	/*
+	 * User isn't a member of the pg_write_server_files role, so check if it's
+	 * allowable
+	 */
 	if (is_absolute_path(filename))
 	{
 		/* Disallow '/a/b/data/..' */
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 2059c07349..2c2f149fb0 100644
--- a/contrib/file_fdw/file_fdw.c
+++ b/contrib/file_fdw/file_fdw.c
@@ -269,13 +269,13 @@ file_fdw_validator(PG_FUNCTION_ARGS)
 			 * otherwise there'd still be a security hole.
 			 */
 			if (strcmp(def->defname, "filename") == 0 &&
-!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES))
+!is_member_of_role(GetUserId(), ROLE_PG_READ_SERVER_FILES))
 ereport(ERROR,
 		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 		 errmsg("only superuser or a member of the pg_read_server_files role may specify the filename option of a file_fdw foreign table")));
 
 			if (strcmp(def->defname, "program") == 0 &&
-!is_member_of_role(GetUserId(), DEFAULT_ROLE_EXECUTE_SERVER_PROGRAM))
+!is_member_of_role(GetUserId(), ROLE_PG_EXECUTE_SERVER_PROGRAM))
 ereport(ERROR,
 		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
 	

Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Joel Jacobson


On Wed, Mar 31, 2021, at 22:25, Isaac Morland wrote:
> 
> Maybe I have a different proposal in mind than anybody else, but I don't 
> think there is a problem with multiple joins to the same table. If the joins 
> are via the same constraint, then a single join is enough, and if they are 
> via different constraints, the constraints have unique names.
> 
> I think if TA is a table with a foreign key constraint CB to another table 
> TB, then the hypothetical expression:
> 
> TA -> CB
> 
> really just means:
> 
> (select TB from TB where (TB.[primary key columns) = (TA.[source columns of 
> constraint CB]))
> 
> You can then add .fieldname to get the required fieldname. The issue is that 
> writing it this way is hopelessly verbose, but the short form is fine. The 
> query planner also needs to be guaranteed to collapse multiple references 
> through the same constraint to a single actual join (and then take all the 
> multiple fields requested).
> 
> If TA is a table with a foreign key constraint CB to TB, which has a foreign 
> key constraint CC to TC, then this expression:
> 
> TA -> CB -> CC
> 
> just means, by the same definition (except I won't expand it fully, only one 
> level):
> 
> (select TC from TC where (TC.[primary key columns) = ((TA -> CB).[source 
> columns of constraint CC]))
> 
> Which reminds me, I often find myself wanting to write something like a.(f1, 
> f2, f3) = b.(f1, f2, f3) rather than (a.f1, a.f2, a.f3) = (b.f1, b.f2, b.f3). 
> But that's another story

Maybe “anonymous join” would be a good name for this, similar to anonymous 
functions. The joined table(s) would not pollute the namespace.

/Joel

Re: Idea: Avoid JOINs by using path expressions to follow FKs

2021-03-31 Thread Vik Fearing
On 3/31/21 6:54 PM, Pavel Stehule wrote:
>>
>>
>>
>> If using the -> notation, you would only need to manually
>> inspect the tables involved in the remaining JOINs;
>> since you could be confident all uses of -> cannot affect cardinality.
>>
>> I think this would be a win also for an expert SQL consultant working
>> with a new complex data model never seen before.
>>
>>
> I did not feel comfortable when I read about this proprietary extension of
> SQL.  I can accept and it can be nice to support ANSI/SQL object's
> referentions, but implementing own syntax for JOIN looks too strange. I
> don't see too strong benefit in inventing new syntax and increasing the
> complexity and possible disorientation of users about correct syntax. Some
> users didn't adopt a difference between old joins and modern joins, and you
> are inventing a third syntax.

I'm with you on this: let's do it the Standard way, or not do it at all.
-- 
Vik Fearing




Re: What to call an executor node which lazily caches tuples in a hash table?

2021-03-31 Thread Andres Freund
Hi,

On 2021-03-31 12:29:36 +1300, David Rowley wrote:
> Here's a list of a few that were mentioned:
> 
> Probe Cache
> Tuple Cache
> Keyed Materialize
> Hash Materialize
> Result Cache
> Cache
> Hash Cache
> Lazy Hash
> Reactive Hash
> Parameterized Hash
> Parameterized Cache
> Keyed Inner Cache
> MRU Cache
> MRU Hash

Suggestion: ParamKeyedCache.

I don't like Parameterized because it makes it sound like the cache is
parameterized (e.g. size), rather than using Param values as the keys
for the cache.  ParamKeyed indicates that Params are the key, rather
than configuring the cache...

I don't like "result cache" all that much, because it does sound like
we'd be caching query results etc, or that it might be referring to
Result nodes. Neither of which is the case...

Greetings,

Andres Freund




Re: libpq debug log

2021-03-31 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-31, Tom Lane wrote:

> I think this is a timing problem that's triggered (on some machines)
> by force_parallel_mode = regress.  Looking at spurfowl's latest
> failure of this type, the postmaster log shows
> 
> 2021-03-31 14:34:54.982 EDT [18233:15] 001_libpq_pipeline.pl LOG:  execute 
> : SELECT 1.0/g FROM generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.992 EDT [18234:1] ERROR:  division by zero
> 2021-03-31 14:34:54.992 EDT [18234:2] STATEMENT:  SELECT 1.0/g FROM 
> generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.993 EDT [18233:16] 001_libpq_pipeline.pl ERROR:  division 
> by zero
> 2021-03-31 14:34:54.993 EDT [18233:17] 001_libpq_pipeline.pl STATEMENT:  
> SELECT 1.0/g FROM generate_series(3, -1, -1) g
> 2021-03-31 14:34:54.995 EDT [18216:4] LOG:  background worker "parallel 
> worker" (PID 18234) exited with exit code 1
> 2021-03-31 14:34:54.995 EDT [18233:18] 001_libpq_pipeline.pl LOG:  could not 
> send data to client: Broken pipe
> 2021-03-31 14:34:54.995 EDT [18233:19] 001_libpq_pipeline.pl FATAL:  
> connection to client lost
> 
> We can see that the division by zero occurred in a parallel worker.
> My theory is that in parallel mode, it's uncertain whether the
> error will be reported before or after the "preceding" successful
> output rows.  So you need to disable parallelism to make this
> test case stable.

Ooh, that makes sense, thanks.  Added a SET in the program itself to set
it to off.

This is not the *only* issue though; at least animal drongo shows a
completely different failure, where the last few tests don't even get to
run, and the server log just has this:

2021-03-31 20:20:14.460 UTC [178364:4] 001_libpq_pipeline.pl LOG:  
disconnection: session time: 0:00:00.469 user=pgrunner database=postgres 
host=127.0.0.1 port=52638
2021-03-31 20:20:14.681 UTC [178696:1] [unknown] LOG:  connection received: 
host=127.0.0.1 port=52639
2021-03-31 20:20:14.687 UTC [178696:2] [unknown] LOG:  connection authorized: 
user=pgrunner database=postgres application_name=001_libpq_pipeline.pl
2021-03-31 20:20:15.157 UTC [178696:3] 001_libpq_pipeline.pl LOG:  could not 
receive data from client: An existing connection was forcibly closed by the 
remote host.

I suppose the program is crashing for some reason.

-- 
Álvaro Herrera   Valdivia, Chile
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)




Re: libpq debug log

2021-03-31 Thread Tom Lane
"'alvhe...@alvh.no-ip.org'"  writes:
> This is not the *only* issue though; at least animal drongo shows a
> completely different failure, where the last few tests don't even get to
> run, and the server log just has this:

That is weird - only test 4 (of 8) runs at all, the rest seem to
fail to connect.  What's different about pipelined_insert?

regards, tom lane




Re: libpq debug log

2021-03-31 Thread Tom Lane
I wrote:
> That is weird - only test 4 (of 8) runs at all, the rest seem to
> fail to connect.  What's different about pipelined_insert?

Oh ... there's a pretty obvious theory.  pipelined_insert is
the only one that is not asked to write a trace file.
So for some reason, opening the trace file fails.
(I wonder why we don't see an error message for this though.)

regards, tom lane




Re: libpq debug log

2021-03-31 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-31, Tom Lane wrote:

> I wrote:
> > That is weird - only test 4 (of 8) runs at all, the rest seem to
> > fail to connect.  What's different about pipelined_insert?
> 
> Oh ... there's a pretty obvious theory.  pipelined_insert is
> the only one that is not asked to write a trace file.
> So for some reason, opening the trace file fails.
> (I wonder why we don't see an error message for this though.)

.. oh, I think we forgot to set conn->Pfdebug = NULL when creating the
connection.  So when we do PQtrace(), the first thing it does is
PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash.
So this should fix it.

-- 
Álvaro Herrera   Valdivia, Chile
 really, I see PHP as like a strange amalgamation of C, Perl, Shell
 inflex: you know that "amalgam" means "mixture with mercury",
   more or less, right?
 i.e., "deadly poison"
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index a90bdb8ab6..56a8266bc3 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -3952,6 +3952,7 @@ makeEmptyPGconn(void)
 	conn->verbosity = PQERRORS_DEFAULT;
 	conn->show_context = PQSHOW_CONTEXT_ERRORS;
 	conn->sock = PGINVALID_SOCKET;
+	conn->Pfdebug = NULL;
 
 	/*
 	 * We try to send at least 8K at a time, which is the usual size of pipe


Re: libpq debug log

2021-03-31 Thread 'alvhe...@alvh.no-ip.org'
On 2021-Mar-31, 'alvhe...@alvh.no-ip.org' wrote:

> .. oh, I think we forgot to set conn->Pfdebug = NULL when creating the
> connection.  So when we do PQtrace(), the first thing it does is
> PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash.
> So this should fix it.

I tried to use MALLOC_PERTURB_ to prove this theory, but I failed at it.
I'll just push this blind and see what happens.

-- 
Álvaro Herrera39°49'30"S 73°17'W
"Someone said that it is at least an order of magnitude more work to do
production software than a prototype. I think he is wrong by at least
an order of magnitude."  (Brian Kernighan)




Re: Support for NSS as a libpq TLS backend

2021-03-31 Thread Jacob Champion
On Fri, 2021-03-26 at 18:05 -0400, Stephen Frost wrote:
> * Jacob Champion (pchamp...@vmware.com) wrote:
> > Yeah. I was hoping to avoid implementing our own locks and refcounts,
> > but it seems like it's going to be required.
> 
> Yeah, afraid so.

I think it gets worse, after having debugged some confusing crashes.
There's already been a discussion on PR_Init upthread a bit:

> Once we settle on a version we can confirm if PR_Init is/isn't needed and
> remove all traces of it if not.

What the NSPR documentation omits is that implicit initialization is
not threadsafe. So NSS_InitContext() is technically "threadsafe"
because it's built on PR_CallOnce(), but if you haven't called
PR_Init() yet, multiple simultaneous PR_CallOnce() calls can crash into
each other.

So, fine. We just add our own locks around NSS_InitContext() (or around
a single call to PR_Init()). Well, the first thread to win and
successfully initialize NSPR gets marked as the "primordial" thread
using thread-local state. And it gets a pthread destructor that does...
something. So lazy initialization seems a bit dangerous regardless of
whether or not we add locks, but I can't really prove whether it's
dangerous or not in practice.

I do know that only the primordial thread is allowed to call
PR_Cleanup(), and of course we wouldn't be able to control which thread
does what for libpq clients. I don't know what other assumptions are
made about the primordial thread, or if there are any platform-specific 
behaviors with older versions of NSPR that we'd need to worry about. It
used to be that the primordial thread was not allowed to exit before
any other threads, but that restriction was lifted at some point [1].

I think we're going to need some analogue to PQinitOpenSSL() to help
client applications cut through the mess, but I'm not sure what it
should look like, or how we would maintain any sort of API
compatibility between the two flavors. And does libpq already have some
notion of a "main thread" that I'm missing?

--Jacob

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=294955


Re: libpq debug log

2021-03-31 Thread Tom Lane
"'alvhe...@alvh.no-ip.org'"  writes:
> On 2021-Mar-31, Tom Lane wrote:
>> So for some reason, opening the trace file fails.
>> (I wonder why we don't see an error message for this though.)

> .. oh, I think we forgot to set conn->Pfdebug = NULL when creating the
> connection.  So when we do PQtrace(), the first thing it does is
> PQuntrace(), and then that tries to do fflush(conn->Pfdebug) ---> crash.
> So this should fix it.

Nope, see the MemSet a few lines further up.  This change seems like
good style, but it won't fix anything --- else we'd have been
having issues with the pre-existing trace logic.

What I suspect is some Windows dependency in the way that
001_libpq_pipeline.pl is setting up the trace output files.
cc'ing Andrew to see if he has any ideas.

regards, tom lane




Re: What to call an executor node which lazily caches tuples in a hash table?

2021-03-31 Thread Adam Brusselback
> Does anyone else like the name "Tuple Cache"?
I personally like that name best.

It makes sense to me when thinking about looking at an EXPLAIN and trying
to understand why this node may be there. The way we look up the value
stored in the cache doesn't really matter to me as a user, I'm more
thinking about the reason the node is there in the first place, which Tuple
Cache conveys...at least for me.


Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Tomas Vondra
Hi,

I think I found the issue - it's kinda obvious, really. We need to
consider the timezone, because the "time" parts alone may be sorted
differently. The attached patch should fix this, and it also fixes a
similar issue in the inet data type.

As for why the regression tests did not catch this, it's most likely
because the data is likely generated in "nice" ordering, or something
like that. I'll see if I can tweak the ordering to trigger these issues
reliably, and I'll do a bit more randomized testing.

There's also the question of rounding errors, which I think might cause
random assert failures (but in practice it's harmless, in the worst case
we'll merge the ranges a bit differently).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index 70109960e8..439e0b414c 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -2090,7 +2090,7 @@ brin_minmax_multi_distance_timetz(PG_FUNCTION_ARGS)
 	TimeTzADT  *ta = PG_GETARG_TIMETZADT_P(0);
 	TimeTzADT  *tb = PG_GETARG_TIMETZADT_P(1);
 
-	delta = tb->time - ta->time;
+	delta = (tb->time + tb->zone * USECS_PER_SEC) - (ta->time + ta->zone * USECS_PER_SEC);
 
 	Assert(delta >= 0);
 
@@ -2292,6 +2292,9 @@ brin_minmax_multi_distance_inet(PG_FUNCTION_ARGS)
 	inet	   *ipa = PG_GETARG_INET_PP(0);
 	inet	   *ipb = PG_GETARG_INET_PP(1);
 
+	int			lena,
+lenb;
+
 	/*
 	 * If the addresses are from different families, consider them to be in
 	 * maximal possible distance (which is 1.0).
@@ -2299,20 +2302,38 @@ brin_minmax_multi_distance_inet(PG_FUNCTION_ARGS)
 	if (ip_family(ipa) != ip_family(ipb))
 		PG_RETURN_FLOAT8(1.0);
 
-	/* ipv4 or ipv6 */
-	if (ip_family(ipa) == PGSQL_AF_INET)
-		len = 4;
-	else
-		len = 16;/* NS_IN6ADDRSZ */
-
 	addra = ip_addr(ipa);
 	addrb = ip_addr(ipb);
 
+	/*
+	 * The length is calculated from the mask length, because we sort the
+	 * addresses by first address in the range, so A.B.C.D/24 < A.B.C.1
+	 * (the first range starts at A.B.C.0, which is before A.B.C.1). We
+	 * don't want to produce negative delta in this case, so we just cut
+	 * the extra bytes.
+	 *
+	 * XXX Maybe this should be a bit more careful and cut the bits, not
+	 * just whole bytes.
+	 */
+	lena = ip_bits(ipa) / 8;
+	lenb = ip_bits(ipb) / 8;
+
+	len = Max(lena, lenb);
+
 	delta = 0;
 	for (i = len - 1; i >= 0; i--)
 	{
-		delta += (float8) addrb[i] - (float8) addra[i];
-		delta /= 256;
+		unsigned char a = addra[i];
+		unsigned char b = addrb[i];
+
+		if (i >= lena)
+			a = 0;
+
+		if (i >= lenb)
+			b = 0;
+
+		delta += (float8) b - (float8) a;
+		delta /= 255;
 	}
 
 	Assert((delta >= 0) && (delta <= 1));


Re: Assertion failure with barriers in parallel hash join

2021-03-31 Thread Melanie Plageman
On Wed, Mar 17, 2021 at 8:18 AM Thomas Munro  wrote:
>
> On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro  wrote:
> > According to BF animal elver there is something wrong with this
> > commit.  Looking into it.
>
> Assertion failure reproduced here and understood, but unfortunately
> it'll take some more time to fix this.  I've reverted the commit for
> now to unbreak the ~5 machines that are currently showing red in the
> build farm.

So, I think the premise of the patch
v6-0001-Fix-race-condition-in-parallel-hash-join-batch-cl.patch makes
sense: freeing the hashtable memory should happen atomically with
updating the flag that indicates to all others that the memory is not to
be accessed any longer.

As was likely reported by the buildfarm leading to you reverting the
patch, when the inner side is empty and we dump out before advancing the
build barrier past PHJ_BUILD_HASHING_OUTER, we trip the new Asserts
you've added in ExecHashTableDetach().

Assert(!pstate ||
BarrierPhase(&pstate->build_barrier) >= PHJ_BUILD_RUNNING);

Hmm.

Maybe if the inner side is empty, we can advance the build barrier to
the end?
We help batch 0 along like this in ExecParallelHashJoinSetUpBatches().

But, I'm not sure we can expect the process executing this code to be
attached to the build barrier, can we?

@@ -296,7 +304,19 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel)
 * outer relation.
 */
if (hashtable->totalTuples == 0 &&
!HJ_FILL_OUTER(node))
+   {
+   if (parallel)
+   {
+   Barrier*build_barrier;
+
+   build_barrier =
¶llel_state->build_barrier;
+   while
(BarrierPhase(build_barrier) < PHJ_BUILD_DONE)
+
BarrierArriveAndWait(build_barrier, 0);
+   BarrierDetach(build_barrier);
+   }
+
return NULL;
+   }




Re: New IndexAM API controlling index vacuum strategies

2021-03-31 Thread Peter Geoghegan
On Wed, Mar 31, 2021 at 9:29 AM Robert Haas  wrote:
> I can't effectively review 0001 because it both changes the code for
> individual functions significantly and reorders them within the file.
> I think it needs to be separated into two patches, one of which makes
> the changes and the other of which reorders stuff. I would probably
> vote for just dropping the second one, since I'm not sure there's
> really enough value there to justify the code churn, but if we're
> going to do it, I think it should definitely be done separately.

Thanks for the review!

I'll split it up that way. I think that I need to see it both ways
before deciding if I should push back on that. I will admit that I was
a bit zealous in rearranging things because it seems long overdue. But
I might well have gone too far with rearranging code.

> * "onerel" is a stupid naming convention that I'd rather not propagate
> further. It makes sense in the context of a function whose job it is
> to iterate over a list of relations and do something for each one. But
> once you're down into the code that only knows about one relation in
> the first place, calling that relation "onerel" rather than "rel" or
> "vacrel" or even "heaprel" is just confusing. Ditto for "onerelid".

I agree, and can change it. Though at the cost of more diff churn.

> * Moving stuff from static variables into LVRelState seems like a
> great idea. Renaming it from LVRelStats seems like a good idea, too.

The static variables were bad, but nowhere near as bad as the
variables that are local to lazy_scan_heap(). They are currently a
gigantic mess.

Not that LVRelStats was much better. We have the latestRemovedXid
field in LVRelStats, and dead_tuples, but *don't* have a bunch of
things that really are stats (it seems to be quite random). Calling
the struct LVRelStats was always dubious.

> * Setting vacrel->lps = NULL "for now" when we already did palloc0 at
> allocation time seems counterproductive.

Okay, will fix.

> * The code associated with the comment block that says "Initialize
> state for a parallel vacuum" has been moved inside lazy_space_alloc().
> That doesn't seem like an especially good choice, because no casual
> reader is going to expect a function called lazy_space_alloc() to be
> entering parallel mode and so forth as a side effect. Also, the call
> to lazy_space_alloc() still has a comment that says "Allocate the
> space for dead tuples in case parallel vacuum is not initialized."
> even though the ParallelVacuumIsActive() check has been removed and
> the function now does a lot more than allocating space.

Will fix.

> * lazy_scan_heap() removes the comment which begins "Note that
> vacrelstats->dead_tuples could have tuples which became dead after
> HOT-pruning but are not marked dead yet." But IIUC that special case
> is removed by a later patch, not 0001, in which case it is that patch
> that should be touching this comment.

Will fix.

> Regarding 0002:
>
> * It took me a while to understand why lazy_scan_new_page() and
> lazy_scan_empty_page() are named the way they are. I'm not sure
> exactly what would be better, so I am not necessarily saying I think
> you have to change anything, but for the record I think this naming
> sucks.

I agree -- it's dreadful.

> The reason we have "lazy" in here, AFAIU, is because originally
> we only had old-style VACUUM FULL, and that was the good hard-working
> VACUUM, and what we now think of as VACUUM was the "lazy" version that
> didn't really do the whole job. Then we decided it was the
> hard-working version that actually sucked and we always wanted to be
> lazy (or else rewrite the table). So now we have all of these
> functions named "lazy" which are really just functions to do "vacuum".

FWIW I always thought that the terminology was lifted from the world
of garbage collection. There is a thing called a lazy sweep algorithm.
Isn't vacuuming very much like sweeping? There are also mark-sweep
garbage collection algorithms that take two passes,  one phase
variants, etc.

In general the world of garbage collection has some ideas that might
be worth pilfering for ideas. It's not all that relevant to our world,
and a lot of it is totally irrelevant, but there is enough overlap for
it to interest me. Though GC is such a vast and complicated world that
it's difficult to know where to begin. I own a copy of the book
"Garbage Collection: Algorithms for Automatic Dynamic Memory
Management". Most of it goes over my head, but I have a feeling that
I'll get my $20 worth at some point.

> If, for example, we could agree that the whole thing is vacuum and the first
> time we touch the heap page that's the strawberry phase and then the
> second time we touch it that's the rhubarb phase, then we could have
> vacuum_strawberry_page(), vacuum_strawberry_new_page(),
> vacuum_rhubarb_phase(), etc. and everything would be a lot clearer,
> assuming that you replaced the words "strawberry" and "rhubarb" with
> something actually

Re: libpq debug log

2021-03-31 Thread Tom Lane
I wrote:
> What I suspect is some Windows dependency in the way that
> 001_libpq_pipeline.pl is setting up the trace output files.

While this may have little to do with drongo's issue,
I'm going to take exception to this bit that I see that
the patch added to PQtrace():

/* Make the trace stream line-buffered */
setvbuf(debug_port, NULL, _IOLBF, 0);

I do not like that on two grounds:

1. The trace file handle belongs to the calling application,
not to libpq.  I do not think libpq has any business editorializing
on the file's settings.

2. POSIX says that setvbuf() must be done before any other
operation on the file.  Since we have no way to know whether
any output has already been written to the trace file, the
above is a spec violation.


... and as for an actual explanation for drongo's issue,
I'm sort of wondering why libpq_pipeline.c is opening the
trace file in "w+" mode rather than vanilla "w" mode.
That seems unnecessary, and maybe it doesn't work so well
on Windows.

regards, tom lane




Re: Assertion failure with barriers in parallel hash join

2021-03-31 Thread Melanie Plageman
On Wed, Mar 17, 2021 at 8:18 AM Thomas Munro  wrote:
>
> On Wed, Mar 17, 2021 at 6:58 PM Thomas Munro  wrote:
> > According to BF animal elver there is something wrong with this
> > commit.  Looking into it.
>
> Assertion failure reproduced here and understood, but unfortunately
> it'll take some more time to fix this.  I've reverted the commit for
> now to unbreak the ~5 machines that are currently showing red in the
> build farm.

Also, silly question: why couldn't we just set the pstate->batches pointer to
InvalidDsaPointer before doing dsa_free() (of course saving the pointer
so that we can actually do the freeing with it)? Is there still a race?




Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Zhihong Yu
Hi,
For inet data type fix:

+   unsigned char a = addra[i];
+   unsigned char b = addrb[i];
+
+   if (i >= lena)
+   a = 0;
+
+   if (i >= lenb)
+   b = 0;

Should the length check precede the addra[i] ?
Something like:

   unsigned char a;
   if (i >= lena) a = 0;
   else a = addra[i];

Cheers

On Wed, Mar 31, 2021 at 3:25 PM Tomas Vondra 
wrote:

> Hi,
>
> I think I found the issue - it's kinda obvious, really. We need to
> consider the timezone, because the "time" parts alone may be sorted
> differently. The attached patch should fix this, and it also fixes a
> similar issue in the inet data type.
>
> As for why the regression tests did not catch this, it's most likely
> because the data is likely generated in "nice" ordering, or something
> like that. I'll see if I can tweak the ordering to trigger these issues
> reliably, and I'll do a bit more randomized testing.
>
> There's also the question of rounding errors, which I think might cause
> random assert failures (but in practice it's harmless, in the worst case
> we'll merge the ranges a bit differently).
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Tomas Vondra
On 4/1/21 12:53 AM, Zhihong Yu wrote:
> Hi,
> For inet data type fix:
> 
> +       unsigned char a = addra[i];
> +       unsigned char b = addrb[i];
> +
> +       if (i >= lena)
> +           a = 0;
> +
> +       if (i >= lenb)
> +           b = 0;
> 
> Should the length check precede the addra[i] ?
> Something like:
> 
>        unsigned char a;
>        if (i >= lena) a = 0;
>        else a = addra[i];
> 

I don't think that makes any difference. We know the bytes are there, we
just want to ignore / reset them in some cases.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Crash in BRIN minmax-multi indexes

2021-03-31 Thread Jaime Casanova
On Wed, Mar 31, 2021 at 5:25 PM Tomas Vondra
 wrote:
>
> Hi,
>
> I think I found the issue - it's kinda obvious, really. We need to
> consider the timezone, because the "time" parts alone may be sorted
> differently. The attached patch should fix this, and it also fixes a
> similar issue in the inet data type.
>

ah! yeah! obvious... if you say so ;)

> As for why the regression tests did not catch this, it's most likely
> because the data is likely generated in "nice" ordering, or something
> like that. I'll see if I can tweak the ordering to trigger these issues
> reliably, and I'll do a bit more randomized testing.
>
> There's also the question of rounding errors, which I think might cause
> random assert failures (but in practice it's harmless, in the worst case
> we'll merge the ranges a bit differently).
>
>

I can confirm this fixes the crash in the query I showed and the original case.

-- 
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL




  1   2   >