date:20190422

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Amit Langote

On 2019/04/23 14:45, David Rowley wrote:
> On Tue, 23 Apr 2019 at 13:49, Amit Langote
>  wrote:
>>
>> On 2019/04/23 7:51, Alvaro Herrera wrote:
>>> To me, it sounds
>>> unintuitive to accept partitions that don't exactly match the order of
>>> the parent table; but it's been supported all along.
>>
>> You might know it already, but even though column sets of two tables may
>> appear identical, their TupleDescs still may not match due to dropped
>> columns being different in the two tables.
> 
> I think that's the most likely reason that the TupleDescs would differ
> at all.  For RANGE partitions on time series data, it's quite likely
> that new partitions are periodically created to store new data.  If
> the partitioned table those belong to evolved over time, gaining new
> columns and dropping columns that are no longer needed then some
> translation work will end up being required.  From my work on
> 42f70cd9c, I know tuple conversion is not free, so it's pretty good
> that pg_dump will remove the need for maps in this case even with the
> proposed change.

Maybe I'm missing something, but if you're talking about pg_dump changes
proposed in the latest patch that Alvaro posted on April 18, which is to
emit partitions as two steps, then I don't see how that will always
improves things in terms of whether maps are needed or not (regardless of
whether that's something to optimize for or not.)  If partitions needed a
map in the old database, this patch means that they will *continue* to
need it in the new database.  With HEAD, they won't, because partitions
created with CREATE TABLE PARTITION OF will have the same descriptor as
parent, provided the parent is also created afresh in the new database,
which is true in the non-binary-upgrade mode.  The current arrangement, as
I mentioned in my previous email, is partly inspired from the fact that
creating the parent and partition afresh in the new database will lead
them to have the same TupleDesc and hence won't need maps.

Thanks,
Amit

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 15:52:59 +0530, Kuntal Ghosh wrote:
> Hello hackers,
> 
> With a master-standby setup configured on the same machine, I'm
> getting a panic in tablespace test while running make installcheck.
> 
> 1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah';
> 2. DROP TABLESPACE regress_tblspacewith;
> 3. CREATE TABLESPACE regress_tblspace LOCATION 'blah';
> -- do some operations in this tablespace
> 4. DROP TABLESPACE regress_tblspace;
> 
> The master panics at the last statement when standby has completed
> applying the WAL up to step 2 but hasn't started step 3.
> PANIC:  could not fsync file
> "pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or
> directory
> 
> The reason is both the tablespace points to the same location. When
> master tries to delete the new tablespace (and requests a checkpoint),
> the corresponding folder is already deleted by the standby while
> applying WAL to delete the old tablespace. I'm able to reproduce the
> issue with the attached script.
> 
> sh standby-server-setup.sh
> make installcheck
> 
> I accept that configuring master-standby on the same machine for this
> test is not okay. But, can we avoid the PANIC somehow? Or, is this
> intentional and I should not include testtablespace in this case?

FWIW, I think the right fix for this is to simply drop the requirement
that tablespace paths need to be absolute. It's not buying us anything,
it's just making things more complicated. We should just do a simple
check against the tablespace being inside PGDATA, and leave it at
that. Yes, that can be tricked, but so can the current system.

That'd make both regression tests easier, as well as operations.

Greetings,

Andres Freund

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Michael Paquier

On Tue, Apr 23, 2019 at 01:33:39PM +0900, Kyotaro HORIGUCHI wrote:
> I think this is rahter a testing or debugging feature. This can
> be apply to all paths, so the variable might be "path_prefix" or
> something more generic than tablespace_chroot.
> 
> Does it make sense?

A GUC which enforces object creation does not sound like a good idea
to me, and what you propose would still bite back, for example two
local nodes could use the same port, but a different Unix socket
path.
--
Michael


signature.asc
Description: PGP signature

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread David Rowley

On Tue, 23 Apr 2019 at 13:49, Amit Langote
 wrote:
>
> On 2019/04/23 7:51, Alvaro Herrera wrote:
> > To me, it sounds
> > unintuitive to accept partitions that don't exactly match the order of
> > the parent table; but it's been supported all along.
>
> You might know it already, but even though column sets of two tables may
> appear identical, their TupleDescs still may not match due to dropped
> columns being different in the two tables.

I think that's the most likely reason that the TupleDescs would differ
at all.  For RANGE partitions on time series data, it's quite likely
that new partitions are periodically created to store new data.  If
the partitioned table those belong to evolved over time, gaining new
columns and dropping columns that are no longer needed then some
translation work will end up being required.  From my work on
42f70cd9c, I know tuple conversion is not free, so it's pretty good
that pg_dump will remove the need for maps in this case even with the
proposed change.

I imagine users randomly specifying columns for partitions in various
random orders is a less likely scenario, although, it's entirely
possible.

Tom's point about being able to pg_dump a single partition and restore
it somewhere without the parent (apart from an error in ATTACH
PARTITION) seems like it could be useful too, so I'd say that the
pg_dump change is a good one regardless.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Paul Guo

Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
create redo error, but I suspect some other kind of redo, which depends on
the files under the directory (they are not copied since the directory is
not created) and also cannot be covered by the invalid page mechanism,
could fail. Thanks.

On Mon, Apr 22, 2019 at 3:40 PM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Oops! The comment in the previous patch is wrong.
>
> At Mon, 22 Apr 2019 16:15:13 +0900 (Tokyo Standard Time), Kyotaro
> HORIGUCHI  wrote in <
> 20190422.161513.258021727.horiguchi.kyot...@lab.ntt.co.jp>
> > At Mon, 22 Apr 2019 12:36:43 +0800, Paul Guo  wrote in
> 
> > > Please see my replies inline. Thanks.
> > >
> > > On Fri, Apr 19, 2019 at 12:38 PM Asim R P  wrote:
> > >
> > > > On Wed, Apr 17, 2019 at 1:27 PM Paul Guo  wrote:
> > > > >
> > > > > create db with tablespace
> > > > > drop database
> > > > > drop tablespace.
> > > >
> > > > Essentially, that sequence of operations causes crash recovery to
> fail
> > > > if the "drop tablespace" transaction was committed before crashing.
> > > > This is a bug in crash recovery in general and should be reproducible
> > > > without configuring a standby.  Is that right?
> > > >
> > >
> > > No. In general, checkpoint is done for
> drop_db/create_db/drop_tablespace on
> > > master.
> > > That makes the file/directory update-to-date if I understand the
> related
> > > code correctly.
> > > For standby, checkpoint redo does not ensure that.
> >
> > That's right partly. As you must have seen, fast shutdown forces
> > restartpoint for the last checkpoint and it prevents the problem
> > from happening. Anyway it seems to be a problem.
> >
> > > > Your patch creates missing directories in the destination.  Don't we
> > > > need to create the tablespace symlink under pg_tblspc/?  I would
> > > >
> > >
> > >  'create db with tablespace' redo log does not include the tablespace
> real
> > > directory information.
> > > Yes, we could add in it into the xlog, but that seems to be an
> overdesign.
> >
> > But I don't think creating directory that is to be removed just
> > after is a wanted solution. The directory most likely to be be
> > removed just after.
> >
> > > > prefer extending the invalid page mechanism to deal with this, as
> > > > suggested by Ashwin off-list.  It will allow us to avoid creating
> > >
> > > directories and files only to remove them shortly afterwards when the
> > > > drop database and drop tablespace records are replayed.
> > > >
> > > >
> > > 'invalid page' mechanism seems to be more proper for missing pages of a
> > > file. For
> > > missing directories, we could, of course, hack to use that (e.g.
> reading
> > > any page of
> > > a relfile in that database) to make sure the tablespace create code
> > > (without symlink)
> > > safer (It assumes those directories will be deleted soon).
> > >
> > > More feedback about all of the previous discussed solutions is welcome.
> >
> > It doesn't seem to me that the invalid page mechanism is
> > applicable in straightforward way, because it doesn't consider
> > simple file copy.
> >
> > Drop failure is ignored any time. I suppose we can ignore the
> > error to continue recovering as far as recovery have not reached
> > consistency. The attached would work *at least* your case, but I
> > haven't checked this covers all places where need the same
> > treatment.
>
> The comment for the new function XLogMakePGDirectory is wrong:
>
> + * There is a possibility that WAL replay causes a creation of the same
> + * directory left by the previous crash. Issuing ERROR prevents the caller
> + * from continuing recovery.
>
> The correct one is:
>
> + * There is a possibility that WAL replay causes an error by creation of a
> + * directory under a directory removed before the previous crash. Issuing
> + * ERROR prevents the caller from continuing recovery.
>
> It is fixed in the attached.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

Re: Symbol referencing errors

2019-04-22 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 >> When I compile PostgreSQL-11.2 on SmartOS, I find the following errors:
 >> ...
 >> ld: warning: symbol referencing errors

 Tom> Yeah, our SmartOS buildfarm members show those warnings too, eg

 Tom> 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=damselfly&dt=2019-04-22%2010%3A00%3A15&stg=make-contrib

 Tom> AFAICT they're harmless, so my advice is just ignore them.

 Tom> If you're sufficiently annoyed by them to find the cause
 Tom> and try to fix it, go ahead, but I haven't heard anyone
 Tom> else worried about it.  It might be that SmartOS wants
 Tom> something like what we have to do on macOS and AIX,
 Tom> ie provide the core postgres executable in some sort of
 Tom> linker switch while linking shlibs that will be loaded
 Tom> by that executable.

I wonder if it's the use of -Bsymbolic that causes this (buildfarm logs
don't seem to go back far enough to check). (Note to original poster:
-Bsymbolic is there for a reason, you can't just remove it - but see
below.)

Since this is an ELF platform - arguably the closest thing to the
original reference ELF platform, at least by descent - it should not
require the kinds of tricks used on macOS and AIX; but we haven't done
the work needed to test using version scripts in place of -Bsymbolic for
fixing the symbol conflict problems. That ought to be a relatively
straightforward project for someone with access to a system to test on
(and I'm happy to advise on it).

The thing to do would be to try and copy the changes made to the *BSD
ports in commit e3d77ea6b instead of the change made in 4fa3741d1. The
contrib/postgres_fdw tests should show whether it worked or not.

-- 
Andrew (irc:RhodiumToad)

Re: [PATCH v1] Show whether tables are logged in \dt+

2019-04-22 Thread Fabien COELHO




Hello David,

I noticed that there wasn't a bulk way to see table logged-ness in psql, 
so I made it part of \dt+.


Applies, compiles, works for me.

ISTM That temporary-ness is not shown either. Maybe the persistence column 
should be shown as is?


Also I'd suggest that the column should be displayed before the 
"description" column to keep the length-varying one last?



What say?


Tests? Doc?

--
Fabien.

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Kyotaro HORIGUCHI

At Tue, 23 Apr 2019 11:34:38 +0900, Michael Paquier  wrote 
in <20190423023438.gh2...@paquier.xyz>
> On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
> > The attached exercises this sequence, needing some changes in
> > PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.
> 
> +# Check for symlink -- needed only on source dir
> +# (note: this will fall through quietly if file is already gone)
> +if (-l $srcpath)
> +{
> +croak "Cannot operate on symlink \"$srcpath\""
> +  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
> +
> +# We have mapped tablespaces. Copy them individually
> +my $linkname = $1;
> +my $tmpdir = TestLib::tempdir;
> +my $dstrealdir = TestLib::real_dir($tmpdir);
> +my $srcrealdir = readlink($srcpath);
> +
> +opendir(my $dh, $srcrealdir);
> +while (readdir $dh)
> +{
> +next if (/^\.\.?$/);
> +my $spath = "$srcrealdir/$_";
> +my $dpath = "$dstrealdir/$_";
> +
> +copypath($spath, $dpath);
> +}
> +closedir $dh;
> +
> +symlink $dstrealdir, $destpath;
> +return 1;
> +}
> 
> The same stuff is proposed here:
> https://www.postgresql.org/message-id/cagrczquxd9yofifokxofj+fp3jdpoekczt+zh_prmnaadae...@mail.gmail.com
> 
> So there is a lot of demand for making the recursive copy more skilled
> at handling symlinks for tablespace tests, and I'd like to propose to
> do something among those lines for the tests on HEAD, presumably for
> v12 and not v13 as we are talking about a bug fix here?  I am not sure
> yet which one of the proposals is better than the other though.

TBH I like that (my one cieted above) not so much. However, I
prefer to have v12 if this is a bug and to be fixed in
v12. Otherwise we won't add a test for this later:p

Anyway I'll visit there. Thanks. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-22 Thread Michael Paquier

On Thu, Apr 18, 2019 at 10:14:30AM +0900, Michael Paquier wrote:
> Doing a REINDEX TABLE directly on pg_class proves to work correctly,
> and CONCURRENTLY is not supported for catalog tables.
> 
> Bisecting my way through it, the first commit causing the breakage is
> that:
> commit: 01e386a325549b7755739f31308de4be8eea110d
> author: Tom Lane 
> date: Wed, 23 Dec 2015 20:09:01 -0500
> Avoid VACUUM FULL altogether in initdb.

This brings down to a first, simple, solution which is to issue a
VACUUM FULL on pg_class at the end of make_template0() in initdb.c to
avoid any subsequent problems if trying to issue a REINDEX on anything
related to pg_class, and it won't fix any existing deployments:
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -2042,6 +2042,11 @@ make_template0(FILE *cmdfd)
 * Finally vacuum to clean up dead rows in pg_database
 */
"VACUUM pg_database;\n\n",
+
+   /*
+* And rebuild pg_class.
+*/
+   "VACUUM FULL pg_class;\n\n",
NULL
};
Now...

> The reason why this does not work is that CatalogIndexInsert() tries
> to do an index_insert directly on the index worked on.  And the reason
> why this works at table level is that we have tweaks in
> reindex_relation() to enforce the list of valid indexes in the
> relation cache with RelationSetIndexList().  It seems to me that the
> logic in reindex_index() is wrong from the start, and that all the
> index list handling done in reindex_relation() should just be in
> reindex_index() so as REINDEX INDEX gets the right call.

I got to wonder if this dance with the relation cache is actually
necessary, because we could directly tell CatalogIndexInsert() to not
insert a tuple into an index which is bring rebuilt, and the index
rebuild would cause an entry to be added to pg_class anyway thanks to
RelationSetNewRelfilenode().  This can obviously only happen for
pg_class indexes.

Any thoughts about both approaches?
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9399bef14..d4284ba08a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3502,7 +3502,6 @@ reindex_relation(Oid relid, int flags, int options)
 	Relation	rel;
 	Oid			toast_relid;
 	List	   *indexIds;
-	bool		is_pg_class;
 	bool		result;
 	int			i;
 
@@ -3538,32 +3537,8 @@ reindex_relation(Oid relid, int flags, int options)
 	 */
 	indexIds = RelationGetIndexList(rel);
 
-	/*
-	 * reindex_index will attempt to update the pg_class rows for the relation
-	 * and index.  If we are processing pg_class itself, we want to make sure
-	 * that the updates do not try to insert index entries into indexes we
-	 * have not processed yet.  (When we are trying to recover from corrupted
-	 * indexes, that could easily cause a crash.) We can accomplish this
-	 * because CatalogTupleInsert/CatalogTupleUpdate will use the relcache's
-	 * index list to know which indexes to update. We just force the index
-	 * list to be only the stuff we've processed.
-	 *
-	 * It is okay to not insert entries into the indexes we have not processed
-	 * yet because all of this is transaction-safe.  If we fail partway
-	 * through, the updated rows are dead and it doesn't matter whether they
-	 * have index entries.  Also, a new pg_class index will be created with a
-	 * correct entry for its own pg_class row because we do
-	 * RelationSetNewRelfilenode() before we do index_build().
-	 */
-	is_pg_class = (RelationGetRelid(rel) == RelationRelationId);
-
-	/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
-	if (is_pg_class)
-		(void) RelationGetIndexAttrBitmap(rel, INDEX_ATTR_BITMAP_ALL);
-
 	PG_TRY();
 	{
-		List	   *doneIndexes;
 		ListCell   *indexId;
 		char		persistence;
 
@@ -3591,15 +3566,11 @@ reindex_relation(Oid relid, int flags, int options)
 			persistence = rel->rd_rel->relpersistence;
 
 		/* Reindex all the indexes. */
-		doneIndexes = NIL;
 		i = 1;
 		foreach(indexId, indexIds)
 		{
 			Oid			indexOid = lfirst_oid(indexId);
 
-			if (is_pg_class)
-RelationSetIndexList(rel, doneIndexes);
-
 			reindex_index(indexOid, !(flags & REINDEX_REL_CHECK_CONSTRAINTS),
 		  persistence, options);
 
@@ -3608,9 +3579,6 @@ reindex_relation(Oid relid, int flags, int options)
 			/* Index should no longer be in the pending list */
 			Assert(!ReindexIsProcessingIndex(indexOid));
 
-			if (is_pg_class)
-doneIndexes = lappend_oid(doneIndexes, indexOid);
-
 			/* Set index rebuild count */
 			pgstat_progress_update_param(PROGRESS_CLUSTER_INDEX_REBUILD_COUNT,
 		 i);
@@ -3626,9 +3594,6 @@ reindex_relation(Oid relid, int flags, int options)
 	PG_END_TRY();
 	ResetReindexPending();
 
-	if (is_pg_class)
-		RelationSetIndexList(rel, indexIds);
-
 	/*
 	 * Close rel, but continue to hold the lock.
 	 */
diff --git a/src/backend/catalog/indexing.c b/src/backend/catalog/indexing.c
index 0c994122d

Re: display of variables in EXPLAIN VERBOSE

2019-04-22 Thread Amit Langote

On 2019/04/23 0:58, Tom Lane wrote:
> BTW, now that I look at this, I think the reason why I didn't make
> tlist printouts pay attention to VERBOSE for this purpose is that
> you don't get them at all if not verbose:
> 
> regression=# explain select * from tt;
>   QUERY PLAN  
> --
>  Seq Scan on tt  (cost=0.00..32.60 rows=2260 width=8)
> (1 row)
> 
> So if we were to be rigidly consistent with this point of the docs,
> there would be no way to see a tlist without variable qualification,
> which doesn't really seem that nice.

Hmm yes.  Variables in sort keys, quals, etc., which are shown without
VERBOSE, are qualified only if VERBOSE is specified.  Variables in the
targetlists that are shown only in the VERBOSE output may be displayed
without qualifications, which looks a bit inconsistent.

explain (verbose, costs off) select * from foo where a > 0 order by 1;
  QUERY PLAN
──
 Sort
   Output: a
   Sort Key: foo.a
   ->  Seq Scan on public.foo
 Output: a
 Filter: (foo.a > 0)
(6 rows)

Maybe, targetlist variables should *always* be qualified given that they
are considered VERBOSE information to begin with?

Thanks,
Amit

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kyotaro HORIGUCHI

At Tue, 23 Apr 2019 13:33:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190423.19.113770648.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 11:27:06 +0900, Michael Paquier  
> wrote in <20190423022706.gg2...@paquier.xyz>
> > On Mon, Apr 22, 2019 at 03:52:59PM +0530, Kuntal Ghosh wrote:
> > > I accept that configuring master-standby on the same machine for this
> > > test is not okay. But, can we avoid the PANIC somehow? Or, is this
> > > intentional and I should not include testtablespace in this case?
> > 
> > Well, it is a bit more than "not okay", as the primary and the
> > standby step on each other's toe because they are trying to use the
> > same tablespace path.  The PANIC is also expected as that's what we
> > want with data_sync_retry = off, which is the default, and the wanted
> > behavior to PANIC immediately and enforce WAL recovery should a fsync
> > fail.  Obviously, not being able to have transparent tablespace
> > handling for multiple nodes on the same host is a problem, though this
> > implies grammar changes for CREATE TABLESPACE or having a sort of
> > node name handling which makes the experience trouble-less.  Still
> > there is the argument that not all users would want both instances to
> > use the same tablespace path.  So the problem is not as simple as it
> > looks, and the cost of solving it is not worth the use cases either.
> 
> We could easily adopt a jail or chroot like feature to tablespace
> paths. Suppose a new GUC(!), say, tablespace_chroot and the value
> can contain replacements like %a, %p, %h, we would set the
> variable as:
> 
> tablespace_chroot = '/somewhere/%p';
> 
> then the tablespace location is prefixed by '/somewhere/5432' for
> the first server, '/somehwere/5433' for the second.
> 
> I think this is rahter a testing or debugging feature. This can
> be apply to all paths, so the variable might be "path_prefix" or

all paths out of $PGDATA directory. tablespace locations,
log_directory and stats_temp_directory?

> something more generic than tablespace_chroot.
> 
> Does it make sense?

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Kyotaro HORIGUCHI

At Tue, 23 Apr 2019 11:27:06 +0900, Michael Paquier  wrote 
in <20190423022706.gg2...@paquier.xyz>
> On Mon, Apr 22, 2019 at 03:52:59PM +0530, Kuntal Ghosh wrote:
> > I accept that configuring master-standby on the same machine for this
> > test is not okay. But, can we avoid the PANIC somehow? Or, is this
> > intentional and I should not include testtablespace in this case?
> 
> Well, it is a bit more than "not okay", as the primary and the
> standby step on each other's toe because they are trying to use the
> same tablespace path.  The PANIC is also expected as that's what we
> want with data_sync_retry = off, which is the default, and the wanted
> behavior to PANIC immediately and enforce WAL recovery should a fsync
> fail.  Obviously, not being able to have transparent tablespace
> handling for multiple nodes on the same host is a problem, though this
> implies grammar changes for CREATE TABLESPACE or having a sort of
> node name handling which makes the experience trouble-less.  Still
> there is the argument that not all users would want both instances to
> use the same tablespace path.  So the problem is not as simple as it
> looks, and the cost of solving it is not worth the use cases either.

We could easily adopt a jail or chroot like feature to tablespace
paths. Suppose a new GUC(!), say, tablespace_chroot and the value
can contain replacements like %a, %p, %h, we would set the
variable as:

tablespace_chroot = '/somewhere/%p';

then the tablespace location is prefixed by '/somewhere/5432' for
the first server, '/somehwere/5433' for the second.

I think this is rahter a testing or debugging feature. This can
be apply to all paths, so the variable might be "path_prefix" or
something more generic than tablespace_chroot.

Does it make sense?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: speeding up planning with partitions

2019-04-22 Thread Amit Langote

On 2019/04/23 7:08, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/04/02 2:34, Tom Lane wrote:
>>> I'm not at all clear on what we think the interaction between
>>> enable_partition_pruning and constraint_exclusion ought to be,
>>> so I'm not sure what the appropriate resolution is here.  Thoughts?
> 
>> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
>> DELETE queries is realized by applying constraint exclusion to the
>> partition constraint of the target partition.  The conclusion of the
>> discussion when adding the enable_partition_pruning GUC [1] was that
>> whether or not constraint exclusion is carried out (to facilitate
>> partition pruning) should be governed by the new GUC, not
>> constraint_exclusion, if only to avoid confusing users.
> 
> I got back to thinking about how this ought to work.

Thanks a lot for taking time to look at this.

> It appears to me
> that we've got half a dozen different behaviors that depend on one or both
> of these settings:
> 
> 1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning,
>   that is relation_excluded_by_constraints for baserels.
>   This is enabled by constraint_exclusion = on.
> 
> 2. Use of partition constraints in baserel pruning (applicable only
>   when a partition is accessed directly).
>   This is currently partly broken, and it's what your patch wants to
>   change.

Yes.  Any fix we come up with for this will need to be back-patched to 11,
because it's a regression introduced in 11 when the then new partition
pruning feature was committed (9fdb675fc).

> 3. Use of ordinary table constraints in appendrel pruning,
>   that is relation_excluded_by_constraints for appendrel members.
>   This is enabled by constraint_exclusion >= partition.
> 
> 4. Use of partition constraints in appendrel pruning.
>   This is enabled by the combination of enable_partition_pruning AND
>   constraint_exclusion >= partition.  However, it looks to me like this
>   is now nearly if not completely useless because of #5.
> 
> 5. Use of partition constraints in expand_partitioned_rtentry.
>   Enabled by enable_partition_pruning (see prune_append_rel_partitions).

Right, #5 obviates #4.

> 6. Use of partition constraints in run-time partition pruning.
>   This is also enabled by enable_partition_pruning, cf
>   create_append_plan, create_merge_append_plan.
> 
> Now in addition to what I mention above, there are assorted random
> differences in behavior depending on whether we are in an inherited
> UPDATE/DELETE or not.  I consider these differences to be so bogus
> that I'm not even going to include them in this taxonomy; they should
> not exist.  The UPDATE/DELETE target ought to act the same as a baserel.

The *partition* constraint of UPDATE/DELETE targets would never be refuted
by the query, because we process only those partition targets that remain
after applying partition pruning during the initial planning of the query
as if it were SELECT.  I'm saying we should distinguish such targets as
such when addressing #2.

Not sure if you'll like it but maybe we could ignore even regular
inheritance child targets that are proven to be empty (is_dummy_rel()) for
a given query during the initial SELECT planning.  That way, we can avoid
re-running relation_excluded_by_constraints() a second time for *all*
child target relations.

> I think this is ridiculously overcomplicated even without said random
> differences.  I propose that we do the following:
> 
> * Get rid of point 4 by not considering partition constraints for
> appendrel members in relation_excluded_by_constraints.  It's just
> useless cycles in view of point 5, or nearly so.  (Possibly there
> are corner cases where we could prove contradictions between a
> relation's partition constraints and regular constraints ... but is
> it really worth spending planner cycles to look for that?)

I guess not.  If partition constraint contradicts regular constraints,
there wouldn't be any data in such partitions to begin with, no?

> * Make point 2 like point 1 by treating partition constraints for
> baserels like ordinary table constraints, ie, they are considered
> only when constraint_exclusion = on (independently of whether
> enable_partition_pruning is on).

Right, enable_partition_pruning should only apply to appendrel pruning.
If a partition is accessed directly and hence a baserel to the planner, we
only consider constraint_exclusion and perform it only if the setting is on.

Another opinion on this is that we treat partition constraints differently
from regular constraints and don't mind the setting of
constraint_exclusion, that is, always perform constraint exclusion using
partition constraints.

> * Treat an inherited UPDATE/DELETE target table as if it were an
> appendrel member for the purposes of relation_excluded_by_constraints,
> thus removing the behavioral differences between SELECT and UPDATE/DELETE.

As I mentioned above, planner encounters any given UPDATE/

Re: Symbol referencing errors

2019-04-22 Thread Li Japin



On 4/23/19 12:09 PM, Tom Lane wrote:
> AFAICT they're harmless, so my advice is just ignore them.
>
> If you're sufficiently annoyed by them to find the cause
> and try to fix it, go ahead, but I haven't heard anyone
> else worried about it.  It might be that SmartOS wants
> something like what we have to do on macOS and AIX,
> ie provide the core postgres executable in some sort of
> linker switch while linking shlibs that will be loaded
> by that executable.
Yes, those errors does not impact the postgresql, but when
I use oracle_fdw extension, I couldn't startup the postgresql,
and I find that the dlopen throw an error which lead postmaster
exit, and there is not more information.


regards,

Japin Li

Re: Symbol referencing errors

2019-04-22 Thread Tom Lane

Li Japin  writes:
> When I compile PostgreSQL-11.2 on SmartOS, I find the following errors:
> ...
> ld: warning: symbol referencing errors

Yeah, our SmartOS buildfarm members show those warnings too, eg

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=damselfly&dt=2019-04-22%2010%3A00%3A15&stg=make-contrib

AFAICT they're harmless, so my advice is just ignore them.

If you're sufficiently annoyed by them to find the cause
and try to fix it, go ahead, but I haven't heard anyone
else worried about it.  It might be that SmartOS wants
something like what we have to do on macOS and AIX,
ie provide the core postgres executable in some sort of
linker switch while linking shlibs that will be loaded
by that executable.

regards, tom lane

Symbol referencing errors

2019-04-22 Thread Li Japin

Hi,

When I compile PostgreSQL-11.2 on SmartOS, I find the following errors:

Undefined            first referenced
  symbol                  in file
per_MultiFuncCall   adminpack.o
end_MultiFuncCall   adminpack.o
BuildTupleFromCStrings  adminpack.o
DecodeDateTime  adminpack.o
TupleDescGetAttInMetadata   adminpack.o
path_is_prefix_of_path  adminpack.o
canonicalize_path   adminpack.o
text_to_cstring adminpack.o
errmsg  adminpack.o
superuser   adminpack.o
errcode_for_file_access adminpack.o
palloc  adminpack.o
CurrentMemoryContext    adminpack.o
pstrdup adminpack.o
ReadDir adminpack.o
FreeFile    adminpack.o
errfinish   adminpack.o
init_MultiFuncCall  adminpack.o
errstart    adminpack.o
AllocateDir adminpack.o
GetUserId   adminpack.o
is_member_of_role   adminpack.o
psprintf    adminpack.o
DataDir adminpack.o
Log_filename    adminpack.o
Log_directory   adminpack.o
AllocateFile    adminpack.o
path_is_relative_and_below_cwd  adminpack.o
HeapTupleHeaderGetDatum adminpack.o
errcode adminpack.o
FreeDir adminpack.o
ParseDateTime   adminpack.o
path_contains_parent_reference  adminpack.o
pg_detoast_datum_packed adminpack.o
CreateTemplateTupleDesc adminpack.o
TupleDescInitEntry  adminpack.o
ld: warning: symbol referencing errors
make[1]: Leaving directory 
'/home/postgres/postgresql-11.2/contrib/adminpack'


My environment is:

# cat /etc/release
     SmartOS x86_64
   Copyright 2010 Sun Microsystems, Inc.  All Rights Reserved.
    Copyright 2015 Joyent, Inc.  All Rights Reserved.
     Use is subject to license terms.
    See joyent_20161108T160947Z for assembly date and time.
# $ pg_config
BINDIR = /home/postgres/pg11.2/bin
DOCDIR = /home/postgres/pg11.2/share/doc
HTMLDIR = /home/postgres/pg11.2/share/doc
INCLUDEDIR = /home/postgres/pg11.2/include
PKGINCLUDEDIR = /home/postgres/pg11.2/include
INCLUDEDIR-SERVER = /home/postgres/pg11.2/include/server
LIBDIR = /home/postgres/pg11.2/lib
PKGLIBDIR = /home/postgres/pg11.2/lib
LOCALEDIR = /home/postgres/pg11.2/share/locale
MANDIR = /home/postgres/pg11.2/share/man
SHAREDIR = /home/postgres/pg11.2/share
SYSCONFDIR = /home/postgres/pg11.2/etc
PGXS = /home/postgres/pg11.2/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/home/postgres/pg11.2' 'CFLAGS=-g -O0'
CC = gcc
CPPFLAGS =
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O0
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,-R'/home/postgres/pg11.2/lib'
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lz -lreadline -lnsl -lsocket -lm
VERSION = PostgreSQL 11.2

Can anyone help me out? Thanks!

Best regards!

Japin Li

Re: clean up docs for v12

2019-04-22 Thread Michael Paquier

On Mon, Apr 22, 2019 at 12:19:55PM -0400, Alvaro Herrera wrote:
> On 2019-Apr-22, Andres Freund wrote:
>> I think it'd be better just to fix s/the all/that all/.
> 
> (and s/if's/if it's/)

FWIW, I have noticed that part when gathering all the pieces for what
became 148266f, still the full paragraph was sort of confusing, so I
have just fixed the most obvious issues reported first.
--
Michael


signature.asc
Description: PGP signature

Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-22 Thread Michael Paquier

On Mon, Apr 22, 2019 at 09:19:33PM +0900, Kyotaro HORIGUCHI wrote:
> The attached exercises this sequence, needing some changes in
> PostgresNode.pm and RecursiveCopy.pm to allow tablespaces.

+# Check for symlink -- needed only on source dir
+# (note: this will fall through quietly if file is already gone)
+if (-l $srcpath)
+{
+croak "Cannot operate on symlink \"$srcpath\""
+  if ($srcpath !~ /\/(pg_tblspc\/[0-9]+)$/);
+
+# We have mapped tablespaces. Copy them individually
+my $linkname = $1;
+my $tmpdir = TestLib::tempdir;
+my $dstrealdir = TestLib::real_dir($tmpdir);
+my $srcrealdir = readlink($srcpath);
+
+opendir(my $dh, $srcrealdir);
+while (readdir $dh)
+{
+next if (/^\.\.?$/);
+my $spath = "$srcrealdir/$_";
+my $dpath = "$dstrealdir/$_";
+
+copypath($spath, $dpath);
+}
+closedir $dh;
+
+symlink $dstrealdir, $destpath;
+return 1;
+}

The same stuff is proposed here:
https://www.postgresql.org/message-id/cagrczquxd9yofifokxofj+fp3jdpoekczt+zh_prmnaadae...@mail.gmail.com

So there is a lot of demand for making the recursive copy more skilled
at handling symlinks for tablespace tests, and I'd like to propose to
do something among those lines for the tests on HEAD, presumably for
v12 and not v13 as we are talking about a bug fix here?  I am not sure
yet which one of the proposals is better than the other though.
--
Michael


signature.asc
Description: PGP signature

Re: Regression test PANICs with master-standby setup on same machine

2019-04-22 Thread Michael Paquier

On Mon, Apr 22, 2019 at 03:52:59PM +0530, Kuntal Ghosh wrote:
> I accept that configuring master-standby on the same machine for this
> test is not okay. But, can we avoid the PANIC somehow? Or, is this
> intentional and I should not include testtablespace in this case?

Well, it is a bit more than "not okay", as the primary and the
standby step on each other's toe because they are trying to use the
same tablespace path.  The PANIC is also expected as that's what we
want with data_sync_retry = off, which is the default, and the wanted
behavior to PANIC immediately and enforce WAL recovery should a fsync
fail.  Obviously, not being able to have transparent tablespace
handling for multiple nodes on the same host is a problem, though this
implies grammar changes for CREATE TABLESPACE or having a sort of
node name handling which makes the experience trouble-less.  Still
there is the argument that not all users would want both instances to
use the same tablespace path.  So the problem is not as simple as it
looks, and the cost of solving it is not worth the use cases either.
--
Michael


signature.asc
Description: PGP signature

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 7:04 PM Tomas Vondra
 wrote:
> Some time ago there was a discussion about prefetching blocks during
> recovery on a standby, and that's a great example of a use case that
> benefit from this - look which blocks where modified in the next chunk
> of WAL, prefetch them. But that requires fairly detailed information
> about which blocks were modified in the next few megabytes of WAL.
>
> So just doing it once per checkpoint (or even anything above a single
> WAL segment) and removing all the detailed LSN location makes it useless
> for this use case.

For this particular use case, wouldn't you want to read the WAL itself
and use that to issue prefetch requests?  Because if you use the
.modblock files, the data file blocks will end up in memory but the
WAL blocks won't, and you'll still be waiting for I/O.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Amit Langote

On 2019/04/23 7:51, Alvaro Herrera wrote:
> On 2019-Mar-06, Tom Lane wrote:
>> David Rowley  writes:
>>> As far as I can see, the biggest fundamental difference with doing
>>> things this way will be that the column order of partitions will be
>>> preserved, where before it would inherit the order of the partitioned
>>> table.  I'm a little unsure if doing this column reordering was an
>>> intended side-effect or not.
>>
>> Well, if the normal behavior results in changing the column order,
>> it'd be necessary to do things differently in --binary-upgrade mode
>> anyway, because there we *must* preserve column order.  I don't know
>> if what you're describing represents a separate bug for pg_upgrade runs,
>> but it might.  Is there any test case for the situation left behind by
>> the core regression tests?
> 
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?

I do remember being too wary initially about letting partitions devolve
into a state of needing tuple conversion during DML execution, which very
well may have been a reason to write the pg_dump support code the way it
is now.  pg_dump chooses to emit partitions with the CREATE TABLE
PARTITION OF syntax because, as it seems has been correctly interpreted on
this thread, it allows partitions to end up with same TupleDesc as the
parent and hence not require tuple conversion in DML execution, unless of
course it's run with --binary-upgrade mode.

Needing tuple conversion is still an overhead but maybe there aren't that
many cases where TupleDescs differ among tables in partition trees, so the
considerations for emitting PARTITION OF syntax may not be all that
relevant.  Also, we've made DML involving partitions pretty efficient
these days by reducing most other overheads, even though nothing has been
done to prevent tuple conversion in the cases it is needed anyway.

> To me, it sounds
> unintuitive to accept partitions that don't exactly match the order of
> the parent table; but it's been supported all along.

You might know it already, but even though column sets of two tables may
appear identical, their TupleDescs still may not match due to dropped
columns being different in the two tables.

> In the statu quo,
> if users dump and restore such a database, the restored partition ends
> up with the column order of the parent instead of its own column order
> (by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
> wrong?  It'll cause an INSERT/COPY direct to the partition that worked
> prior to the restore to fail after the restore, if the column list isn't
> specified.

That's true, although there is a workaround as you mentioned -- specify
column names to match the input data.  pg_dump itself specifies them, so
the dumped output can be loaded unchanged.

Anyway, I don't see a problem with changing pg_dump to *always* emit
CREATE TABLE followed by ATTACH PARTITION, not just in --binary-upgrade
mode, if it lets us deal with the tablespace-related issues smoothly.

Thanks,
Amit

Re: change password_encryption default to scram-sha-256?

2019-04-22 Thread Jonathan S. Katz

On 4/8/19 6:10 PM, Jonathan S. Katz wrote:
> On 4/8/19 4:20 PM, Alvaro Herrera wrote:
>> On 2019-Apr-08, Jonathan S. Katz wrote:
>>
>>> On 4/8/19 4:10 PM, Alvaro Herrera wrote:
>>
 I wonder why we have two pages
 https://wiki.postgresql.org/wiki/Client_Libraries
 https://wiki.postgresql.org/wiki/List_of_drivers
>>>
>>> No clue, but it appears that first one is the newer of the two[1][2]
>>>
>>> I'd be happy to consolidate them and provide a forwarding reference from
>>> Client Libraries to List of Drivers, given I think we reference "List of
>>> Drivers" in other places.
>>
>> There are two links to List of drivers, and one of them is in Client
>> Libraries :-)
>> https://wiki.postgresql.org/wiki/Special:WhatLinksHere/Client_Libraries
>>
>> +1 for consolidation and setting up a redirect.
> 
> OK, so trying to not be too off topic, I did update the original page as so:
> 
> https://wiki.postgresql.org/wiki/List_of_drivers
> 
> That said, I am still in favor of the PG13 plan, and without objection I
> would like to reach out to the driver authors in the "no" category,
> reference this thread, and that this is at least discussed, if not
> decided upon, and they should considering adding support for SCRAM to
> allow adequate testing time as well as time for their drivers to make it
> into appropriate packaging systems.

OK so a small update, going through the list[1]:

- The golang drivers all now support SCRAM
- I've reached out to the remaining two driver projects on the list to
make them aware of this thread and the timeline discussion, and to offer
any help where needed in adding SCRAM support.

Jonathan

[1]https://wiki.postgresql.org/wiki/List_of_drivers



signature.asc
Description: OpenPGP digital signature

Re: memory leak checking

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 20:29:17 -0400, Tom Lane wrote:
>> I would not call the timezone data a "leak", since it's still useful, and
>> accessible from static pointers, right up to exit.  A true leak for this
>> purpose is memory that's allocated but not usefully accessible, and I'd
>> say we discourage that; though small one-time leaks may not be worth the
>> trouble to get rid of.

> Right. I was only referring to it that way because the various leak
> checking tools do, should've been more careful in wording...

FWIW, I just did a simple test with valgrind's --leak-check=full,
and I can't find any clear evidence of *any* real leaks in a normal
backend run.  The things that valgrind thinks are leaks seem to be
mostly that it doesn't understand what we're doing.  For example,
(1) it seems to be fooled by pass-by-reference Datums, probably
because the underlying type declaration is an integer type not void*.
(2) it doesn't seem to understand how we manage the element arrays
for dynahash tables, because it claims they're all possibly leaked;
(3) it claims strings passed to putenv() have been leaked;
... etc etc.

Admittedly, this is with RHEL6's valgrind which isn't too modern,
but the net result doesn't really motivate me to spend more time here.

regards, tom lane

Re: memory leak checking

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 20:29:17 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-22 16:50:25 -0700, Mikhail Bautin wrote:
> >> What is the standard memory leak checking policy for the PostgreSQL
> >> codebase? I know there is some support for valgrind -- is the test suite
> >> being run continuously with valgrind on the build farm?
> 
> > Leaks are allowed if they are once-per-backend type things. There's no
> > point in e.g. freeing information for timezone metadata, given that
> > it'll be used for the whole server lifetime. And there's such things in
> > psql too, IIRC.
> 
> I would not call the timezone data a "leak", since it's still useful, and
> accessible from static pointers, right up to exit.  A true leak for this
> purpose is memory that's allocated but not usefully accessible, and I'd
> say we discourage that; though small one-time leaks may not be worth the
> trouble to get rid of.

Right. I was only referring to it that way because the various leak
checking tools do, should've been more careful in wording...

Greetings,

Andres Freund

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 08:52:11PM -0400, Bruce Momjian wrote:
> Well, the interesting question is whether the server will generate a
> single modblock file for all WAL in pg_wal only right before we are
> ready to expire some WAL, or whether modblock files will be generated
> offline, perhaps independent of the server, and perhaps by aggregating
> smaller modblock files.
> 
> To throw out an idea, what if we had an executable that could generate a
> modblock file by scanning a set of WAL files?  How far would that take
> us to meeing incremental backup needs?  I can imagine db/relfilenode oid
> volatility could be a problem, but might be fixable.

Well, this actually brings up a bunch of questions:

*  How often do we create blockmod files?  Per segment, per checkpoint,
   at WAL deletion time (1GB?)

*  What is the blockmod file format?  dboid, relfilenode, blocknum?
   Use compression?  Sorted?

*  How do we create incremental backups?

*  What is the incremental backup file format?

*  How do we apply incremental backups to base backups?

And there are some secondary questions:

*  Can blockmod files be merged?

*  Can incremental backups be merged?

*  Can blockmod files be used for restore prefetching?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: bug in update tuple routing with foreign partitions

2019-04-22 Thread Amit Langote

Fujita-san,

On 2019/04/22 20:00, Etsuro Fujita wrote:
> (2019/04/19 14:39), Etsuro Fujita wrote:
>> (2019/04/19 13:00), Amit Langote wrote:
>>> I agree that we can move to fix the other issue right away as the fix you
>>> outlined above seems reasonable, but I wonder if it wouldn't be better to
>>> commit the two fixes separately? The two fixes, although small, are
>>> somewhat complicated and combining them in a single commit might be
>>> confusing. Also, a combined commit might make it harder for the release
>>> note author to list down the exact set of problems being fixed.
>>
>> OK, I'll separate the patch into two.
> 
> After I tried to separate the patch a bit, I changed my mind: I think we
> should commit the two issues in a single patch, because the original issue
> that overriding fmstate for the UPDATE operation mistakenly by fmstate for
> the INSERT operation caused backend crash is fixed by what I proposed
> above.  So I add the commit message to the previous single patch, as you
> suggested.

Ah, you're right.  The case that would return wrong result, that is now
prevented per the latest patch, is also the case that would crash before.
So, it seems to OK to keep this commit this as one patch.  Sorry for the
noise.

I read your commit message and it seems to sufficiently explain the issues
being fixed.  Thanks for adding me as an author, but I think the latest
patch is mostly your work, so I'm happy to be listed as just a reviewer. :)

>>> Some mostly cosmetic comments on the code changes:
>>>
>>> * In the following comment:
>>>
>>> + /*
>>> + * If the foreign table is an UPDATE subplan resultrel that hasn't yet
>>> + * been updated, routing tuples to the table might yield incorrect
>>> + * results, because if routing tuples, routed tuples will be mistakenly
>>> + * read from the table and updated twice when updating the table --- it
>>> + * would be nice if we could handle this case; but for now, throw an
>>> error
>>> + * for safety.
>>> + */
>>>
>>> the part that start with "because if routing tuples..." reads a bit
>>> unclear to me. How about writing this as:
>>>
>>> /*
>>> * If the foreign table we are about to insert routed rows into is
>>> * also an UPDATE result rel and the UPDATE hasn't been performed yet,
>>> * proceeding with the INSERT will result in the later UPDATE
>>> * incorrectly modifying those routed rows, so prevent the INSERT ---
>>> * it would be nice if we could handle this case, but for now, throw
>>> * an error for safety.
>>> */
>>
>> I think that's better than mine; will use that wording.
> 
> Done.  I simplified your wording a little bit, though.

Thanks, looks fine.

> Other changes:
> * I updated the docs in postgres-fdw.sgml to mention the limitation.

Looks fine.

> * I did some cleanups for the regression tests.

Here too.

> Please find attached an updated version of the patch.

I don't have any more comments.  Thanks for working on this.

Thanks,
Amit

[PATCH v1] Show whether tables are logged in \dt+

2019-04-22 Thread David Fetter

Folks,

I noticed that there wasn't a bulk way to see table logged-ness in
psql, so I made it part of \dt+.

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 518efc003730c441663aae42c18d16f0908bd55d Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 22 Apr 2019 17:50:48 -0700
Subject: [PATCH v1] Show whether tables are logged in \dt+
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.20.1"

This is a multi-part message in MIME format.
--2.20.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


\d would show this for individual tables, but there wasn't an
overarching view of all tables. Now, there is.

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index ee00c5da08..5ef567c123 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3631,7 +3631,7 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	PQExpBufferData buf;
 	PGresult   *res;
 	printQueryOpt myopt = pset.popt;
-	static const bool translate_columns[] = {false, false, true, false, false, false, false};
+	static const bool translate_columns[] = {false, false, true, false, false, false, false, false};
 
 	/* If tabtypes is empty, we default to \dtvmsE (but see also command.c) */
 	if (!(showTables || showIndexes || showViews || showMatViews || showSeq || showForeign))
@@ -3695,6 +3695,11 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 		appendPQExpBuffer(&buf,
 		  ",\n  pg_catalog.obj_description(c.oid, 'pg_class') as \"%s\"",
 		  gettext_noop("Description"));
+
+		if (pset.sversion >= 91000)
+			appendPQExpBuffer(&buf,
+			  ",\n  c.relpersistence <> 'u' as \"%s\"",
+			  gettext_noop("Logged"));
 	}
 
 	appendPQExpBufferStr(&buf,

--2.20.1--

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Tue, Apr 23, 2019 at 02:13:29AM +0200, Tomas Vondra wrote:
> Well, I understand that concern - all I'm saying is that makes this
> useless for some use cases (that may or may not be important enough).
> 
> However, it seems to me those files are guaranteed to be much smaller
> than the WAL segments, so I don't see how size alone could be an issue
> as long as we do the merging and deduplication when recycling the
> segments. At that point the standby can't request the WAL from the
> primary anyway, so it won't need the raw .mdblock files either.
> 
> And we probably only care about the size of the data we need to keep for
> a long time. And that we can deduplicate/reorder any way we want.

Well, the interesting question is whether the server will generate a
single modblock file for all WAL in pg_wal only right before we are
ready to expire some WAL, or whether modblock files will be generated
offline, perhaps independent of the server, and perhaps by aggregating
smaller modblock files.

To throw out an idea, what if we had an executable that could generate a
modblock file by scanning a set of WAL files?  How far would that take
us to meeing incremental backup needs?  I can imagine db/relfilenode oid
volatility could be a problem, but might be fixable.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: memory leak checking

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 16:50:25 -0700, Mikhail Bautin wrote:
>> What is the standard memory leak checking policy for the PostgreSQL
>> codebase? I know there is some support for valgrind -- is the test suite
>> being run continuously with valgrind on the build farm?

> Leaks are allowed if they are once-per-backend type things. There's no
> point in e.g. freeing information for timezone metadata, given that
> it'll be used for the whole server lifetime. And there's such things in
> psql too, IIRC.

I would not call the timezone data a "leak", since it's still useful, and
accessible from static pointers, right up to exit.  A true leak for this
purpose is memory that's allocated but not usefully accessible, and I'd
say we discourage that; though small one-time leaks may not be worth the
trouble to get rid of.

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra

On Mon, Apr 22, 2019 at 07:44:45PM -0400, Bruce Momjian wrote:

On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:

On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> > > Oh.  Well, I already explained my algorithm for doing that upthread,
> > > which I believe would be quite cheap.
> > >
> > > 1. When you generate the .modblock files, stick all the block
> > > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > > order.
> >
> > Having all of the block references in a sorted order does seem like it
> > would help, but would also make those potentially quite a bit larger
> > than necessary (I had some thoughts about making them smaller elsewhere
> > in this discussion).  That might be worth it though.  I suppose it might
> > also be possible to line up the bitmaps suggested elsewhere to do
> > essentially a BitmapOr of them to identify the blocks changed (while
> > effectively de-duping at the same time).
>
> I don't see why this would make them bigger than necessary.  If you
> sort by relfilenode/fork/blocknumber and dedup, then references to
> nearby blocks will be adjacent in the file.  You can then decide what
> format will represent that most efficiently on output.  Whether or not
> a bitmap is better idea than a list of block numbers or something else
> depends on what percentage of blocks are modified and how clustered
> they are.
>

Not sure I understand correctly - do you suggest to deduplicate and sort
the data before writing them into the .modblock files? Because that the
the sorting would make this information mostly useless for the recovery
prefetching use case I mentioned elsewhere. For that to work we need
information about both the LSN and block, in the LSN order.

So if we want to allow that use case to leverage this infrastructure, we
need to write the .modfiles kinda "raw" and do this processing in some
later step.

Now, maybe the incremental backup use case is so much more important the
right thing to do is ignore this other use case, and I'm OK with that -
as long as it's a conscious choice.

I think the concern is that the more graunular the modblock files are
(with less de-duping), the larger they will be.

Well, I understand that concern - all I'm saying is that makes this
useless for some use cases (that may or may not be important enough).

However, it seems to me those files are guaranteed to be much smaller
than the WAL segments, so I don't see how size alone could be an issue
as long as we do the merging and deduplication when recycling the
segments. At that point the standby can't request the WAL from the
primary anyway, so it won't need the raw .mdblock files either.

And we probably only care about the size of the data we need to keep for
a long time. And that we can deduplicate/reorder any way we want.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: memory leak checking

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 16:50:25 -0700, Mikhail Bautin wrote:
> What is the standard memory leak checking policy for the PostgreSQL
> codebase? I know there is some support for valgrind -- is the test suite
> being run continuously with valgrind on the build farm?

There's continuous use of valgrind on the buildfarm - but those animals
have leak checking disabled. Postgres for nearly all server allocations
uses memory contexts, which allows to bulk-free memory. There's also
plenty memory that's intentionally allocated till the end of the backend
lifetime (e.g. the various caches over the system catalogs).  Due to
that checks like valgrinds are not particularly meaningful.

> Is there any plan to support clang's AddressSanitizer?

Not for the leak portion. I use asan against the backend, after
disabling the leak check, and that's useful. Should probably set up a
proper buildfarm animal for that.

> I've seen a thread that memory leaks are allowed in initdb, because it is a
> short-lived process. Obviously they are not allowed in the database server.
> Are memory leaks allowed in the psql tool?

Leaks are allowed if they are once-per-backend type things. There's no
point in e.g. freeing information for timezone metadata, given that
it'll be used for the whole server lifetime. And there's such things in
psql too, IIRC.

Greetings,

Andres Freund

memory leak checking

2019-04-22 Thread Mikhail Bautin

Hello PostgreSQL Hackers,

What is the standard memory leak checking policy for the PostgreSQL
codebase? I know there is some support for valgrind -- is the test suite
being run continuously with valgrind on the build farm?

Is there any plan to support clang's AddressSanitizer?

I've seen a thread that memory leaks are allowed in initdb, because it is a
short-lived process. Obviously they are not allowed in the database server.
Are memory leaks allowed in the psql tool?

Regards,
Mikhail

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Tue, Apr 23, 2019 at 01:21:27AM +0200, Tomas Vondra wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> > On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> > > > Oh.  Well, I already explained my algorithm for doing that upthread,
> > > > which I believe would be quite cheap.
> > > >
> > > > 1. When you generate the .modblock files, stick all the block
> > > > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > > > order.
> > > 
> > > Having all of the block references in a sorted order does seem like it
> > > would help, but would also make those potentially quite a bit larger
> > > than necessary (I had some thoughts about making them smaller elsewhere
> > > in this discussion).  That might be worth it though.  I suppose it might
> > > also be possible to line up the bitmaps suggested elsewhere to do
> > > essentially a BitmapOr of them to identify the blocks changed (while
> > > effectively de-duping at the same time).
> > 
> > I don't see why this would make them bigger than necessary.  If you
> > sort by relfilenode/fork/blocknumber and dedup, then references to
> > nearby blocks will be adjacent in the file.  You can then decide what
> > format will represent that most efficiently on output.  Whether or not
> > a bitmap is better idea than a list of block numbers or something else
> > depends on what percentage of blocks are modified and how clustered
> > they are.
> > 
> 
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.
> 
> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.
> 
> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I think the concern is that the more graunular the modblock files are
(with less de-duping), the larger they will be.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra

On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:

On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:

> Oh.  Well, I already explained my algorithm for doing that upthread,
> which I believe would be quite cheap.
>
> 1. When you generate the .modblock files, stick all the block
> references into a buffer.  qsort().  Dedup.  Write out in sorted
> order.

Having all of the block references in a sorted order does seem like it
would help, but would also make those potentially quite a bit larger
than necessary (I had some thoughts about making them smaller elsewhere
in this discussion).  That might be worth it though.  I suppose it might
also be possible to line up the bitmaps suggested elsewhere to do
essentially a BitmapOr of them to identify the blocks changed (while
effectively de-duping at the same time).

I don't see why this would make them bigger than necessary.  If you
sort by relfilenode/fork/blocknumber and dedup, then references to
nearby blocks will be adjacent in the file.  You can then decide what
format will represent that most efficiently on output.  Whether or not
a bitmap is better idea than a list of block numbers or something else
depends on what percentage of blocks are modified and how clustered
they are.

Not sure I understand correctly - do you suggest to deduplicate and sort
the data before writing them into the .modblock files? Because that the
the sorting would make this information mostly useless for the recovery
prefetching use case I mentioned elsewhere. For that to work we need
information about both the LSN and block, in the LSN order.

So if we want to allow that use case to leverage this infrastructure, we
need to write the .modfiles kinda "raw" and do this processing in some
later step.

Now, maybe the incremental backup use case is so much more important the
right thing to do is ignore this other use case, and I'm OK with that -
as long as it's a conscious choice.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> Now that I re-read this complaint once again, I wonder if a mismatching
> column order in partitions isn't a thing we ought to preserve anyhow.
> Robert, Amit -- is it by design that pg_dump loses the original column
> order for partitions, when not in binary-upgrade mode?

I haven't looked at the partitioning code, but I am quite sure that that's
always happened for old-style inheritance children, and I imagine pg_dump
is just duplicating that old behavior.

Wasn't there already a patch on the table to change this, though,
by removing the code path that uses inheritance rather than the
binary-upgrade-like solution?

regards, tom lane

Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman
>  wrote:
>> PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I 
>> was
>> wondering if there is a reason not to call it inside BufFileCreateTemp.

> The best answer I can think of is that a BufFileCreateTemp() caller
> might not want to do catalog access. Perhaps the contortions within
> assign_temp_tablespaces() are something that callers ought to opt in
> to explicitly.

It's kind of hard to see a reason to call it outside a transaction,
and even if we did, there are provisions for it not to go boom.

> That doesn't seem like a particularly good or complete answer, though.
> Perhaps it should simply be called within BufFileCreateTemp(). The
> BufFile/fd.c layering is confusing in a number of ways IMV.

I don't actually see why BufFileCreateTemp should do it; if
we're to add a call, seems like OpenTemporaryFile is the place,
as that's what is really concerned with the temp tablespace(s).

I'm in favor of doing this, I think, as it sure looks to me like
gistInitBuildBuffers() is calling BufFileCreateTemp without any
closely preceding PrepareTempTablespaces.  So we already have an
instance of Melanie's bug in core.  It'd be difficult to notice
because of the silent-fallback-to-default-tablespace behavior.

regards, tom lane

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra

On Mon, Apr 22, 2019 at 01:15:49PM -0400, Bruce Momjian wrote:

On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:

On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> I assumed the modblock files would be stored in the WAL archive so some
> external tools could generate incremental backups using just the WAL
> files.  I assumed they would also be sent to standby servers so
> incremental backups could be done on standby servers too.

Yeah, that's another possible approach.  I am not sure what is best.

I am thinking you need to allow any of these, and putting the WAL files
in pg_wal and having them streamed and archived gives that flexibility.

I agree - this would be quite useful for the prefetching use case I've
already mentioned in my previous message.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Tomas Vondra


On Thu, Apr 18, 2019 at 04:25:24PM -0400, Robert Haas wrote:

On Thu, Apr 18, 2019 at 3:51 PM Bruce Momjian  wrote:

How would you choose the STARTLSN/ENDLSN?  If you could do it per
checkpoint, rather than per-WAL, I think that would be great.


I thought of that too.  It seems appealing, because you probably only
really care whether a particular block was modified between one
checkpoint and the next, not exactly when during that interval it was
modified. 


That's probably true for incremental backups, but there are other use
cases that could leverage this information.

Some time ago there was a discussion about prefetching blocks during
recovery on a standby, and that's a great example of a use case that
benefit from this - look which blocks where modified in the next chunk
of WAL, prefetch them. But that requires fairly detailed information
about which blocks were modified in the next few megabytes of WAL.

So just doing it once per checkpoint (or even anything above a single
WAL segment) and removing all the detailed LSN location makes it useless
for this use case.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Mar-06, Tom Lane wrote:

> David Rowley  writes:
> > As far as I can see, the biggest fundamental difference with doing
> > things this way will be that the column order of partitions will be
> > preserved, where before it would inherit the order of the partitioned
> > table.  I'm a little unsure if doing this column reordering was an
> > intended side-effect or not.
> 
> Well, if the normal behavior results in changing the column order,
> it'd be necessary to do things differently in --binary-upgrade mode
> anyway, because there we *must* preserve column order.  I don't know
> if what you're describing represents a separate bug for pg_upgrade runs,
> but it might.  Is there any test case for the situation left behind by
> the core regression tests?

Now that I re-read this complaint once again, I wonder if a mismatching
column order in partitions isn't a thing we ought to preserve anyhow.
Robert, Amit -- is it by design that pg_dump loses the original column
order for partitions, when not in binary-upgrade mode?  To me, it sounds
unintuitive to accept partitions that don't exactly match the order of
the parent table; but it's been supported all along.  In the statu quo,
if users dump and restore such a database, the restored partition ends
up with the column order of the parent instead of its own column order
(by virtue of being created as CREATE TABLE PARTITION OF).  Isn't that
wrong?  It'll cause an INSERT/COPY direct to the partition that worked
prior to the restore to fail after the restore, if the column list isn't
specified.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 3:12 PM Melanie Plageman
 wrote:
> PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I 
> was
> wondering if there is a reason not to call it inside BufFileCreateTemp.

The best answer I can think of is that a BufFileCreateTemp() caller
might not want to do catalog access. Perhaps the contortions within
assign_temp_tablespaces() are something that callers ought to opt in
to explicitly.

That doesn't seem like a particularly good or complete answer, though.
Perhaps it should simply be called within BufFileCreateTemp(). The
BufFile/fd.c layering is confusing in a number of ways IMV.

-- 
Peter Geoghegan

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 4:43 PM Alvaro Herrera  wrote:
> Well, frequently when people discuss ideas on this list, others discuss
> and provide further ideas to try help to find a working solution, rather
> than throw every roadblock they can think of (though roadblocks are
> indeed thrown now and then).  If I've taken a long time to find a
> working solution, maybe it's because I have no shoulders of giants to
> stand on, and I'm a pretty short guy, so I need to build me a ladder.

What exactly do you mean by throwing up roadblocks?  I don't have a
basket full of great ideas for how to solve this problem that I'm
failing to suggest out of some sort of perverse desire to see you
fail.  I'm not quite as convinced as Tom and Andres that this whole
idea is fatally flawed and can't ever be made to work correctly, but I
think it's quite possible that they are right, both because their
objections sound to me like they are target and because they are
pretty smart people.  But that's also because I haven't spent a lot of
time on this issue, which I think is pretty fair, because it seems
like it would be unfair to complain that I am not spending enough time
helping fix code that I advised against committing in the first place.
How much time should I spent giving you advice if my previous advice
was ignored?

But FWIW, it seems to me that a good place to start solving this
problem would be to think hard about what Andres said here:

http://postgr.es/m/20190306161744.22jdkg37fyi2z...@alap3.anarazel.de

Specifically, this complaint: "I still think the feature as is doesn't
seem to have very well defined behaviour."

If we know what the feature is supposed to do, then it should be
possible to look at each relevant piece of code and decides whether it
implements the specification correctly or not.  But if we don't have a
specification -- that is, we don't know precisely what the feature is
supposed to do -- then we'll just end up whacking the behavior around
trying to get some combination that makes sense, and the whole effort
is probably doomed.

I think that the large quote block from David Rowley in the middle of
the above-linked email gets at the definitional problem pretty
clearly: the documentation seems to be intending to say -- although it
is not 100% clear -- that if TABLESPACE is specified it has effect on
all future children, and if not specified then those children get the
tablespace they would have gotten anyway.  But that behavior is
impossible to implement correctly unless there is a way to distinguish
between a partitioned table for which TABLESPACE was not specified and
where it was specified to be the same as the default tablespace for
the database.  And we know, per previous conversation, that the
catalog representation admits of no way to make that distinction.
Using reltablespace = dattablespace is clearly the wrong answer,
because that breaks stuff.  Tom earlier suggested, I believe, that
some fixed OID could be used, e.g. reltablespace = 1 means "tablespace
not specified" and reltablespace = 0 means dattablespace.  That should
be safe enough, since I don't think OID 1 can ever be the OID of a
real tablespace.  I think this is probably the only way forward if
this definition captures the desired behavior.

The other option is to decide that we want some other behavior.  In
that case, the way forward depends on what we want the behavior to be.
Your proposal upthread is to disallow the case where the user provides
an explicit TABLESPACE setting whose value matches the default
tablespace for the database.  But that definition seems to have an
obvious problem: just because that's not true when the partitioned
table is defined doesn't mean it won't become true later, because the
default tablespace for the database can be changed, or the database
can be copied and the copy be assigned a different default tablespace.
Even if there were no issue, that definition doesn't sound very clean,
because it means that reltablespace = 0 for a regular relation means
dattablespace and for a partitioned relation it means none.

Now that's not the only way we could go either, I suppose.  There must
be a variety of other possible behaviors.  But the one Tom and Andres
are proposing -- go back to the way it worked in v10 -- is definitely
not crazy.  You are right that a lot of people didn't like that, but
the definition was absolutely clear, because it was the exact same
definition we use for normal tables, with the straigthforward
extension that for relations with no storage it meant nothing.

Going back to the proposal of making OID = 0 mean TABLESPACE
dattablespace, OID = 1 meaning no TABLESPACE clause specified for this
partitioned table, and OID = whatever meaning that particular
non-default tablespace, I do see a couple of problems with that idea
too:

- I don't have any idea how to salvage things in v11, where the
catalog representation is irredeemably ambiguous.  We'd probably have
to be satisfied with fairly goofy

Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-22 Thread Melanie Plageman

Hi,
PrepareTempTablespaces is called by most callers of BufFileCreateTemp, so I
was
wondering if there is a reason not to call it inside BufFileCreateTemp.

As a developer using BufFileCreateTemp to write code that will create spill
files, it was easy to forget the extra step of checking the temp_tablespaces
GUC to ensure I create the spill files there if it is set.

Thanks,
Melanie Plageman

Re: speeding up planning with partitions

2019-04-22 Thread Tom Lane

Amit Langote  writes:
> On 2019/04/02 2:34, Tom Lane wrote:
>> I'm not at all clear on what we think the interaction between
>> enable_partition_pruning and constraint_exclusion ought to be,
>> so I'm not sure what the appropriate resolution is here.  Thoughts?

> Prior to 428b260f87 (that is, in PG 11), partition pruning for UPDATE and
> DELETE queries is realized by applying constraint exclusion to the
> partition constraint of the target partition.  The conclusion of the
> discussion when adding the enable_partition_pruning GUC [1] was that
> whether or not constraint exclusion is carried out (to facilitate
> partition pruning) should be governed by the new GUC, not
> constraint_exclusion, if only to avoid confusing users.

I got back to thinking about how this ought to work.  It appears to me
that we've got half a dozen different behaviors that depend on one or both
of these settings:

1. Use of ordinary table constraints (CHECK, NOT NULL) in baserel pruning,
  that is relation_excluded_by_constraints for baserels.
  This is enabled by constraint_exclusion = on.

2. Use of partition constraints in baserel pruning (applicable only
  when a partition is accessed directly).
  This is currently partly broken, and it's what your patch wants to
  change.

3. Use of ordinary table constraints in appendrel pruning,
  that is relation_excluded_by_constraints for appendrel members.
  This is enabled by constraint_exclusion >= partition.

4. Use of partition constraints in appendrel pruning.
  This is enabled by the combination of enable_partition_pruning AND
  constraint_exclusion >= partition.  However, it looks to me like this
  is now nearly if not completely useless because of #5.

5. Use of partition constraints in expand_partitioned_rtentry.
  Enabled by enable_partition_pruning (see prune_append_rel_partitions).

6. Use of partition constraints in run-time partition pruning.
  This is also enabled by enable_partition_pruning, cf
  create_append_plan, create_merge_append_plan.

Now in addition to what I mention above, there are assorted random
differences in behavior depending on whether we are in an inherited
UPDATE/DELETE or not.  I consider these differences to be so bogus
that I'm not even going to include them in this taxonomy; they should
not exist.  The UPDATE/DELETE target ought to act the same as a baserel.

I think this is ridiculously overcomplicated even without said random
differences.  I propose that we do the following:

* Get rid of point 4 by not considering partition constraints for
appendrel members in relation_excluded_by_constraints.  It's just
useless cycles in view of point 5, or nearly so.  (Possibly there
are corner cases where we could prove contradictions between a
relation's partition constraints and regular constraints ... but is
it really worth spending planner cycles to look for that?)

* Make point 2 like point 1 by treating partition constraints for
baserels like ordinary table constraints, ie, they are considered
only when constraint_exclusion = on (independently of whether
enable_partition_pruning is on).

* Treat an inherited UPDATE/DELETE target table as if it were an
appendrel member for the purposes of relation_excluded_by_constraints,
thus removing the behavioral differences between SELECT and UPDATE/DELETE.

With this, constraint_exclusion would act pretty much as it traditionally
has, and in most cases would not have any special impact on partitions
compared to old-style inheritance.  The behaviors that
enable_partition_pruning would control are expand_partitioned_rtentry
pruning and run-time pruning, neither of which have any applicability to
old-style inheritance.

Thoughts?

regards, tom lane

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Juan José Santamaría Flecha

Actually, I tried to show my findings with the tr_TR regression test, but
you
can reproduce the same issue with other locales and non-ASCII characters, as
Tom has pointed out.

For exampe:

de_DE ISO-8859-1: March
es_ES ISO-8859-1: Wednesday
fr_FR ISO-8859-1: February

Regards,

Juan José Santamaría Flecha

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 22:56:20 +0200, Laurenz Albe wrote:
> On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote:
> > On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> > > Commit 3d956d956a introduced support for foreign tables as partitions
> > > and COPY FROM on foreign tables.
> > > 
> > > If a foreign data wrapper supports data modifications, but either has
> > > not adapted to this change yet or doesn't want to support it
> > > for other reasons, it probably got broken by the above commit,
> > > because COPY will just call ExecForeignInsert anyway, which might not
> > > work because neither PlanForeignModify nor BeginForeignModify have
> > > been called.
> > > 
> > > To avoid breaking third-party foreign data wrappers in that way, allow
> > > COPY FROM and tuple routing for foreign tables only if the foreign data
> > > wrapper implements BeginForeignInsert.
> > 
> > Isn't this worse though? Before this it's an API change between major
> > versions. With this it's an API change in a *minor* version. Sure, it's
> > one that doesn't crash, but it's still a pretty substantial function
> > regression, no?
> 
> Hm, that's a good point.
> You could say that this patch is too late, because a FDW might already be
> relying on COPY FROM to work without BeginForeignInsert in v11.

I think that's the case.


> How about just applying the patch from v12 on?
> Then it is a behavior change in a major release, which is acceptable.
> It requires the imaginary FDW above to add an empty BeginForeignInsert
> callback function, but will unbreak FDW that slept through the change
> that added COPY support.

I fail to see the advantage. It'll still require FDWs to be fixed to
work correctly for v11, but additionally adds another set of API
differences that needs to be fixed by another set of FDWs.  I think this
ship simply has sailed.

Greetings,

Andres Freund

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 13:27 -0700, Andres Freund wrote:
> On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> > Commit 3d956d956a introduced support for foreign tables as partitions
> > and COPY FROM on foreign tables.
> > 
> > If a foreign data wrapper supports data modifications, but either has
> > not adapted to this change yet or doesn't want to support it
> > for other reasons, it probably got broken by the above commit,
> > because COPY will just call ExecForeignInsert anyway, which might not
> > work because neither PlanForeignModify nor BeginForeignModify have
> > been called.
> > 
> > To avoid breaking third-party foreign data wrappers in that way, allow
> > COPY FROM and tuple routing for foreign tables only if the foreign data
> > wrapper implements BeginForeignInsert.
> 
> Isn't this worse though? Before this it's an API change between major
> versions. With this it's an API change in a *minor* version. Sure, it's
> one that doesn't crash, but it's still a pretty substantial function
> regression, no?

Hm, that's a good point.
You could say that this patch is too late, because a FDW might already be
relying on COPY FROM to work without BeginForeignInsert in v11.

How about just applying the patch from v12 on?
Then it is a behavior change in a major release, which is acceptable.
It requires the imaginary FDW above to add an empty BeginForeignInsert
callback function, but will unbreak FDW that slept through the change
that added COPY support.

Yours,
Laurenz Albe

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 16:24 -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe  wrote:
> > Sure, it is not hard to modify a FDW to continue working with v11.
> > 
> > My point is that this should not be necessary.
> 
> I'm not sure whether this proposal is a good idea or a bad idea, but I
> think that it's inevitable that FDWs are going to need some updating
> for new releases as the API evolves.  That has happened before and
> will continue to happen.

Absolutely.
I am just unhappy that this change caused unnecessary breakage.

If you developed a read-only FDW for 9.2, it didn't break with the
write support introduced in 9.3, because that used different API
functions.  That's how it should be IMHO.

If you developed a FDW for 9.1, it got broken in 9.2, because the
API had to change to allow returning multiple paths.
That was unfortunate but necessary, so it is ok.

Nothing in this patch required an incompatible change.

Yours,
Laurenz Albe

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Robert Haas wrote:

> On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera  
> wrote:
> > On 2019-Apr-22, Andres Freund wrote:
> > > Why is the obvious answer is to not just remove the whole tablespace
> > > inheritance behaviour?
> >
> > Because it was requested by many, and there were plenty of people
> > surprised that things didn't work that way.
> 
> On the other hand, that behavior worked correctly on its own terms,
> and this behavior seems to be broken, and you've been through a whole
> series of possible designs trying to figure out how to fix it, and
> it's still not clear that you've got a working solution.

Well, frequently when people discuss ideas on this list, others discuss
and provide further ideas to try help to find a working solution, rather
than throw every roadblock they can think of (though roadblocks are
indeed thrown now and then).  If I've taken a long time to find a
working solution, maybe it's because I have no shoulders of giants to
stand on, and I'm a pretty short guy, so I need to build me a ladder.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 21:37:25 +0200, Laurenz Albe wrote:
> Subject: [PATCH] Foreign table COPY FROM and tuple routing requires
>  BeginForeignInsert
> 
> Commit 3d956d956a introduced support for foreign tables as partitions
> and COPY FROM on foreign tables.
> 
> If a foreign data wrapper supports data modifications, but either has
> not adapted to this change yet or doesn't want to support it
> for other reasons, it probably got broken by the above commit,
> because COPY will just call ExecForeignInsert anyway, which might not
> work because neither PlanForeignModify nor BeginForeignModify have
> been called.
> 
> To avoid breaking third-party foreign data wrappers in that way, allow
> COPY FROM and tuple routing for foreign tables only if the foreign data
> wrapper implements BeginForeignInsert.

Isn't this worse though? Before this it's an API change between major
versions. With this it's an API change in a *minor* version. Sure, it's
one that doesn't crash, but it's still a pretty substantial function
regression, no?

Greetings,

Andres Freund

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 3:37 PM Laurenz Albe  wrote:
> Sure, it is not hard to modify a FDW to continue working with v11.
>
> My point is that this should not be necessary.

I'm not sure whether this proposal is a good idea or a bad idea, but I
think that it's inevitable that FDWs are going to need some updating
for new releases as the API evolves.  That has happened before and
will continue to happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 3:08 PM Alvaro Herrera  wrote:
> On 2019-Apr-22, Andres Freund wrote:
> > Why is the obvious answer is to not just remove the whole tablespace
> > inheritance behaviour?
>
> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

On the other hand, that behavior worked correctly on its own terms,
and this behavior seems to be broken, and you've been through a whole
series of possible designs trying to figure out how to fix it, and
it's still not clear that you've got a working solution.  I don't know
whether that shows that it is impossible to make this idea work
sensibly, but at the very least it proves that the whole area needed a
lot more thought than it got before this code was committed (a
complaint that I also made at the time, if you recall).  "Surprising"
is not great, but it is clearly superior to "broken."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WIP: long transactions on hot standby feedback replica / proof of concept

2019-04-22 Thread Alexander Korotkov

On Thu, Nov 29, 2018 at 3:44 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Aug 24, 2018 at 5:53 PM Alexander Korotkov 
> >  wrote:
> >
> > Given I've no feedback on this idea yet, I'll try to implement a PoC
> > patch for that.  It doesn't look to be difficult.  And we'll see how
> > does it work.
>
> Unfortunately, current version of the patch doesn't pass the tests and fails 
> on
> initdb. But maybe you already have this PoC you were talking about, that will
> also incorporate the feedback from this thread? For now I'll move it to the
> next CF.

Finally, I managed to write a PoC.

If look at the list of problems I've enumerated in [1], this PoC is
aimed for 1 and 3.

> 1) Data corruption on file truncation error (explained in [1]).
> 2) Expensive scanning of the whole shared buffers before file truncation.
> 3) Cancel of read-only queries on standby even if hot_standby_feedback
> is on, caused by replication of AccessExclusiveLock.

2 is pretty independent problem and could be addressed later.

Basically, this patch does following:
1. Introduces new flag BM_DIRTY_BARRIER, which prevents dirty buffer
from being written out.
2. Implements two-phase truncation of node buffers.  First phase is
prior to file truncation and marks past truncation point dirty buffers
as BM_DIRTY_BARRIER.  Second phase is post file truncation and
actually wipes out past truncation point buffers.
3. On exception happen during file truncation, BM_DIRTY_BARRIER flag
will be released from buffers.  Thus, no data corruption should
happens here.  If file truncation was partially complete, then file
might be extended by write of dirty buffer.  I'm not sure how likely
is it, but extension could lead to the errors again.  But this still
shouldn't cause a data corruption.
4. Having too many buffers marked as BM_DIRTY_BARRIER, would paralyze
buffer manager.  This is why we're keeping not more than NBuffers/2 to
be marked as BM_DIRTY_BARRIER.  If limit is exceeded, then dirty
buffers are just written at the first phase.
5. lazy_truncate_heap() now takes ExclusiveLock instead of
AccessExclusiveLock.  This part is not really complete.  At least, we
need to ensure that past truncation point reads, caused by real-only
queries concurrent to truncation, don't lead to real errors.

Any thoughts?

1. 
https://www.postgresql.org/message-id/CAPpHfdtD3U2DpGZQJNe21s9s1s-Va7NRNcP1isvdCuJxzYypcg%40mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

0001-POC-fix-relation-truncation-1.patch
Description: Binary data

translatability tweaks

2019-04-22 Thread Alvaro Herrera

Here's a small number of translatability tweaks to the error messages in
the backend.  I found these while updating the Spanish translation over
the past weekend, a task I had neglected for two or three years, so they
might involve some older messages.  However, I won't backpatch this --
only apply to pg12 tomorrow or so.

(I haven't yet gone over the full backend message catalog yet, so I
might propose more fixes later on.)

-- 
Álvaro Herrerahttp://www.twitter.com/alvherre
>From 091d4043aca3fa4936f1020b635a78e8b6d9010d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 22 Apr 2019 15:45:06 -0400
Subject: [PATCH] Unify error messages

... for translatability purposes.
---
 src/backend/storage/ipc/latch.c   | 12 +---
 src/backend/storage/ipc/signalfuncs.c |  4 +++-
 src/backend/utils/adt/formatting.c|  9 ++---
 src/backend/utils/adt/genfile.c   |  4 +++-
 src/backend/utils/adt/json.c  |  4 +++-
 src/backend/utils/adt/jsonb.c |  4 +++-
 6 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 59fa917ae04..e0712f906a1 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -856,7 +856,9 @@ WaitEventAdjustEpoll(WaitEventSet *set, WaitEvent *event, int action)
 	if (rc < 0)
 		ereport(ERROR,
 (errcode_for_socket_access(),
- errmsg("epoll_ctl() failed: %m")));
+ /* translator: %s is a syscall name, such as "poll()" */
+ errmsg("%s failed: %m",
+		"epoll_ctl()")));
 }
 #endif
 
@@ -1087,7 +1089,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			waiting = false;
 			ereport(ERROR,
 	(errcode_for_socket_access(),
-	 errmsg("epoll_wait() failed: %m")));
+	 /* translator: %s is a syscall name, such as "poll()" */
+	 errmsg("%s failed: %m",
+			"epoll_wait()")));
 		}
 		return 0;
 	}
@@ -1211,7 +1215,9 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			waiting = false;
 			ereport(ERROR,
 	(errcode_for_socket_access(),
-	 errmsg("poll() failed: %m")));
+	 /* translator: %s is a syscall name, such as "poll()" */
+	 errmsg("%s failed: %m",
+			"poll()")));
 		}
 		return 0;
 	}
diff --git a/src/backend/storage/ipc/signalfuncs.c b/src/backend/storage/ipc/signalfuncs.c
index 4769b1b51eb..4bfbd57464c 100644
--- a/src/backend/storage/ipc/signalfuncs.c
+++ b/src/backend/storage/ipc/signalfuncs.c
@@ -181,7 +181,9 @@ pg_rotate_logfile(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  (errmsg("must be superuser to rotate log files with adminpack 1.0"),
-  errhint("Consider using pg_logfile_rotate(), which is part of core, instead.";
+  /* translator: %s is a SQL function name */
+  errhint("Consider using %s, which is part of core, instead.",
+		  "pg_logfile_rotate()";
 
 	if (!Logging_collector)
 	{
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index df1db7bc9f1..69a691f18e7 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1566,7 +1566,8 @@ str_tolower(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for lower() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"lower()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
@@ -1688,7 +1689,8 @@ str_toupper(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for upper() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"upper()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
@@ -1811,7 +1813,8 @@ str_initcap(const char *buff, size_t nbytes, Oid collid)
  */
 ereport(ERROR,
 		(errcode(ERRCODE_INDETERMINATE_COLLATION),
-		 errmsg("could not determine which collation to use for initcap() function"),
+		 errmsg("could not determine which collation to use for %s function",
+"initcap()"),
 		 errhint("Use the COLLATE clause to set the collation explicitly.")));
 			}
 			mylocale = pg_newlocale_from_collation(collid);
diff --git a/src/backend/utils/adt/genfile.c b/src/backend/utils/adt/genfile.c
index d6976609968..a3c6adaf640 100644
--- a/src/backend/utils/adt/genfile.c
+++ b/src/backend/utils/adt/genfile.c
@@ -219,7 +219,9 @@ pg_read_file(PG_FUNCTION_ARGS)
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  (errmsg("must be superuser to read files with adminpack 1.0"),
-  errhint("

Re: block-level incremental backup

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost  wrote:
> There was basically zero discussion about what things would look like at
> a protocol level (I went back and skimmed over the thread before sending
> my last email to specifically see if I was going to get this response
> back..).  I get the idea behind the diff file, the contents of which I
> wasn't getting into above.

Well, I wrote:

"There should be a way to tell pg_basebackup to request from the
server only those blocks where LSN >= threshold_value."

I guess I assumed that people would interested in the details take
that to mean "and therefore the protocol would grow an option for this
type of request in whatever way is the most straightforward possible
extension of the current functionality is," which is indeed how you
eventually interpreted it when you said we could "extend BASE_BACKUP
is by adding LSN as an optional parameter."

I could have been more explicit, but sometimes people tell me that my
emails are too long.

> external tools to leverage that.  It sounds like what you're suggesting
> now is that you're happy to implement the backend code, expose it in a
> way that works just for pg_basebackup, and that if someone else wants to
> add things to the protocol to make it easier for external tools to
> leverage, great.

Yep, that's more or less it, although I am potentially willing to do
some modest amount of that other work along the way.  I just don't
want to prioritize it higher than getting the actual thing I want to
build built, which I think is a pretty fair position for me to take.

> All I can say is that that's basically how we ended up
> in the situation we're in today where pg_basebackup doesn't support
> parallel backup but a bunch of external tools do and they don't go
> through the backend to get there, even though they'd probably prefer to.

I certainly agree that core should try to do things in a way that is
useful to external tools when that can be done without undue effort,
but only if it can actually be done without undo effort.  Let's see
whether that's the case here:

- Anastasia wants a command added that dumps out whatever the server
knows about what files have changed, which I already agreed was a
reasonable extension of my initial proposal.

- You said that for this to be useful to pgbackrest, it'd have to use
a whole different mechanism that includes commands to request
individual files and blocks within those files, which would be a
significant rewrite of pg_basebackup that you agreed is more closely
related to parallel backup than to the project under discussion on
this thread.  And that even then pgbackrest probably wouldn't use it
because it also does server-side compression and encryption which are
not included in this proposal.

It seems to me that the first one falls into the category a reasonable
additional effort and the second one falls into the category of lots
of extra and unrelated work that wouldn't even get used.

> Thanks for sharing your thoughts on that, certainly having the backend
> able to be more intelligent about streaming files to avoid latency is
> good and possibly the best approach.  Another alternative to reducing
> the latency would be to have a way for the client to request a set of
> files, but I don't know that it'd be better.

I don't know either.  This is an area that needs more thought, I
think, although as discussed, it's more related to parallel backup
than $SUBJECT.

> I'm not really sure why the above is extremely inconvenient for
> third-party tools, beyond just that they've already been written to work
> with an assumption that the server-side of things isn't as intelligent
> as PG is.

Well, one thing you might want to do is have a tool that connects to
the server, enters backup mode, requests information on what blocks
have changed, copies those blocks via direct filesystem access, and
then exits backup mode.  Such a tool would really benefit from a
START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
command language, because it would just skip ever issuing the
SEND_FILE_CONTENTS command in favor of doing that part of the work via
other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
command is useless to such a tool.

Contrariwise, a tool that has its own magic - perhaps based on
WAL-scanning or something like ptrack - to know which files currently
exist and which blocks are modified could use SEND_FILE_CONTENTS but
not SEND_FILE_LIST.  And a filesystem-snapshot based technique might
use START_BACKUP and STOP_BACKUP but nothing else.

In short, providing granular commands like this lets the client be
really intelligent even if the server isn't, and lets the client have
fine-grained control of the process.  This is very good if you're an
out-of-core tool maintainer and your tool is trying to be smarter than
- or even just differently-designed than - core.

But if what you really want is just a maximally-efficient parallel
backup, you don't nee

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-22 Thread Laurenz Albe

On Mon, 2019-04-22 at 21:45 +0900, Etsuro Fujita wrote:

Thanks for looking into this!

> (2019/04/20 20:53), Laurenz Albe wrote:
> > On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote:
> > > Allow insert and update tuple routing and COPY for foreign tables.
> > > 
> > > Also enable this for postgres_fdw.
> > > 
> > > Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
> > > patch series of which this is a part has been reviewed by Amit
> > > Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
> > > and me.  Minor documentation changes to the final version by me.
> > > 
> > > Discussion: 
> > > http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp
> > 
> > This commit makes life hard for foreign data wrappers that support
> > data modifications.
> > 
> > If a FDW implements ExecForeignInsert, this commit automatically assumes
> > that it also supports COPY FROM.  It will call ExecForeignInsert without
> > calling PlanForeignModify and BeginForeignModify, and a FDW that does not
> > expect that will probably fail.
> 
> This is not 100% correct; the FDW documentation says:
> 
>  
>   Tuples inserted into a partitioned table by 
> INSERT or
>   COPY FROM are routed to partitions.  If an FDW
>   supports routable foreign-table partitions, it should also provide the
>   following callback functions.  These functions are also called when
>   COPY FROM is executed on a foreign table.
>  

I don't see the difference between the documentation and what I wrote above.

Before v11, a FDW could expect that ExecForeignInsert is only called if
BeginForeignModify was called earlier.
That has silently changed with v11.

> > maybe there are FDWs that support INSERT but don't want to support COPY
> > for some reason.
> 
> I agree on that point.
> 
> > I propose that PostgreSQL only allows COPY FROM on a foreign table if the 
> > FDW
> > implements BeginForeignInsert.  The attached patch implements that.
> 
> I don't think that is a good idea, because there might be some FDWs that 
> want to support COPY FROM on foreign tables without providing 
> BeginForeignInsert.  (As for INSERT into foreign tables, we actually 
> allow FDWs to support it without providing PlanForeignModify, 
> BeginForeignModify, or EndForeignModify.)
> 
> It's permissible to throw an error in BeginForeignInsert, so what I was 
> thinking for FDWs that don't want to support COPY FROM and 
> INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert 
> implementing something like this:
> 
> static void
> fooBeginForeignInsert(ModifyTableState *mtstate,
>ResultRelInfo *resultRelInfo)
> {
>  Relationrel = resultRelInfo->ri_RelationDesc;
> 
>  if (mtstate->ps.plan == NULL)
>  ereport(ERROR,
>  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>   errmsg("cannot copy to foreign table \"%s\"",
>  RelationGetRelationName(rel;
>  else
>  ereport(ERROR,
>  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>   errmsg("cannot route tuples into foreign table \"%s\"",
>  RelationGetRelationName(rel;
> }

Sure, it is not hard to modify a FDW to continue working with v11.

My point is that this should not be necessary.

If a FDW worked well with v10, it should continue to work with v11
unless there is a necessity for a compatibility-breaking change.

On the other hand, if a FDW wants to support COPY in v11 and has no
need for BeginForeignInsert to support that, it is a simple exercise
for it to provide an empty BeginForeignInsert just to signal that it
wants to support COPY.

I realized that my previous patch forgot to check for tuple routing,
updated patch attached.

Yours,
Laurenz Albe
From 545ac9d567ea6ca610ce08355451923cc787e13d Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Mon, 22 Apr 2019 21:35:06 +0200
Subject: [PATCH] Foreign table COPY FROM and tuple routing requires
 BeginForeignInsert

Commit 3d956d956a introduced support for foreign tables as partitions
and COPY FROM on foreign tables.

If a foreign data wrapper supports data modifications, but either has
not adapted to this change yet or doesn't want to support it
for other reasons, it probably got broken by the above commit,
because COPY will just call ExecForeignInsert anyway, which might not
work because neither PlanForeignModify nor BeginForeignModify have
been called.

To avoid breaking third-party foreign data wrappers in that way, allow
COPY FROM and tuple routing for foreign tables only if the foreign data
wrapper implements BeginForeignInsert.
---
 doc/src/sgml/fdwhandler.sgml |  2 ++
 src/backend/commands/copy.c  |  5 +
 src/backend/executor/execPartition.c | 17 -
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2c07a86637..84f

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Apr-22, Andres Freund wrote:
>> Why is the obvious answer is to not just remove the whole tablespace
>> inheritance behaviour?

> Because it was requested by many, and there were plenty of people
> surprised that things didn't work that way.

There are lots of things in SQL that people find surprising.
In this particular case, "we can't do it because it conflicts with
ancient decisions about how PG tablespaces work" seems like a
defensible answer, even without getting into the question of
whether "partitions inherit their tablespace from the parent"
is really any less surprising than "partitions work exactly like
normal tables as far as tablespace selection goes".

regards, tom lane

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
>  wrote:
>> How does one do that? Just set a Turkish locale?

> tr_TR is, in a sense, special among locales:
> http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html
> The Turkish dotless i has apparently been implicated in all kinds of
> bugs in quite a variety of contexts.

Yeah, we've had our share of those :-(.  But the dotless i is not the
problem here --- it happens to not trigger an encoding conversion
issue, it seems.  Amusingly, the existing test case for lc_time = tr_TR
in collate.linux.utf8.sql is specifically coded to check what happens
with dotted/dotless i, and yet it manages to not trip over this problem.
(I suspect the reason is that what comes out of strftime is "Nis" which
is ASCII, and the non-ASCII characters only arise from subsequent case
conversion within PG.)

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Tom Lane wrote:

> Yeah, that's where I'm at as well.  Alvaro's proposal could be made
> to work perhaps, but I think it would still end up with some odd
> corner-case behaviors.  One example is that "TABLESPACE X" would
> be allowed if the database's default tablespace is Y, but if you
> try to dump and restore into a database whose default is X, it'd be
> rejected (?).

Hmm, I don't think so, because dump uses default_tablespace on a plain
table instead of TABLESPACE clauses, and the table is attached
afterwards.

> The results after ALTER DATABASE ... SET TABLESPACE X
> are unclear too.

Currently we disallow SET TABLESPACE X if you have any table in that
tablespace, and we do that by searching for files.  A partitioned table
would not have a file that would cause it to fail, so this is something
to study.

(BTW I think these tablespace behaviors are not tested very much.  The
tests we have are intra-database operations only, and there's only a
single non-default tablespace.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> > I'm disappointed that the concerns about the trouble that end users are
> > likely to have with this didn't garner more discussion.
> 
> My impression is that endusers are having a lot more trouble due to
> important backup/restore features not being in core/pg_basebackup, than
> due to external tools having a harder time to implement certain
> features.

I had been referring specifically to the concern I raised about
incremental block-level backups being added to pg_basebackup and how
that'll make using pg_basebackup more complicated and therefore more
difficult for end-users to get right, particularly if the end user is
having to handle management of the association between the full backup
and the incremental backups.  I wasn't referring to anything regarding
external tools.

> Focusing on external tools being able to provide all those
> features, because core hasn't yet, is imo entirely the wrong thing to
> concentrate upon.  And it's not like things largely haven't been
> implemented in pg_basebackup for fundamental architectural reasons.
> It's because we've built like 5 different external tools with randomly
> differing featureset and licenses.

There's a few challenges when it comes to adding backup features to
core.  One of the reasons is that core naturally moves slower when it
comes to development than external projects do, as was discusssed
earlier on this thread.  Another is that, when it comes to backup,
specifically, people want to back up their *existing* systems, which
means that they need a backup tool that's going to work with whatever
version of PG they've currently got deployed and that's often a few
years old already.  Certainly when I've thought about features that we'd
like to see and considered if there's something that could be
implemented in core vs. implemented outside of core, the answer often
ends up being "well, if we do it ourselves then we can make it work for
PG 9.2 and above, and have it working for existing users, but if we work
it in as part of core, it won't be available until next year and only
for version 12 and above, and users can only use it once they've
upgraded.."

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Andres Freund wrote:

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour?

Because it was requested by many, and there were plenty of people
surprised that things didn't work that way.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread Tom Lane

=?UTF-8?Q?Micha=c5=82_=22phoe=22_Herda?=  writes:
> My reasoning in this case is - if we allow the any[] type to only be
> passed to other functions that accept any[], and disallow any kind of
> other operations on this array (such as retrieving its elements or
> modifying it), I do not yet see any places where it might introduce a
> performance regression.

Performance regressions are not the question here --- or at least, there
are a lot of other questions to get past first.

* plpgsql doesn't have any mechanism for restricting the use of a
parameter in the way you suggest.  It's not clear if it'd be practical
to add one, given the arms-length way in which plpgsql does expression
evaluation, and it seems likely that any such thing would be messy
and bug-prone.

* There's not actually any such type as any[].  There's anyarray,
which is not what you're wishing for here because it'd constrain
all the actual arguments to be the same type (or at least coercible
to the same array element type).  This is related to the next point...

* format() isn't declared as taking any[].  It's really

regression=# \df format
  List of functions
   Schema   |  Name  | Result data type | Argument data types  | Type 
++--+--+--
 pg_catalog | format | text | text | func
 pg_catalog | format | text | text, VARIADIC "any" | func
(2 rows)

"VARIADIC any" is a very special hack, because unlike other VARIADIC
cases, it doesn't result in collapsing the actual arguments into an
array.  (Again, it can't because they might not have a common type.)
The called function has to have special logic for examining its
arguments to find out how many there are and what their types are.
format() can do that because it's written in C, but a plpgsql function,
not so much.

* We could imagine allowing a plpgsql function to be declared
"VARIADIC any", and treating it as having N polymorphic arguments of
independent types, but then what?  plpgsql has no notation for
accessing such arguments (they wouldn't have names, to start with),
nor for finding out how many there are, and it certainly has no
notation for passing the whole group of them on to some other function.

I think the closest you'll be able to get here is to declare the
plpgsql function as taking "variadic text[]" and then passing the
text array to format() with a VARIADIC marker.  That should work
mostly okay, although calls might need explicit casts to text
in some cases.

FWIW, it'd likely be easier to get around these problems in plperl
or pltcl than in plpgsql, as those are both much less concerned
with the exact data types of their arguments than plpgsql, and
more accustomed to dealing with functions with variable argument
lists.  I don't know Python well enough to say whether the same
is true of plpython.

regards, tom lane

Re: TM format can mix encodings in to_char()

2019-04-22 Thread Peter Geoghegan

On Sun, Apr 21, 2019 at 6:26 AM Andrew Dunstan
 wrote:
> How does one do that? Just set a Turkish locale?

tr_TR is, in a sense, special among locales:

http://blog.thetaphi.de/2012/07/default-locales-default-charsets-and.html

The Turkish dotless i has apparently been implicated in all kinds of
bugs in quite a variety of contexts.

-- 
Peter Geoghegan

Re: block-level incremental backup

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:26:40 -0400, Stephen Frost wrote:
> I'm disappointed that the concerns about the trouble that end users are
> likely to have with this didn't garner more discussion.

My impression is that endusers are having a lot more trouble due to
important backup/restore features not being in core/pg_basebackup, than
due to external tools having a harder time to implement certain
features. Focusing on external tools being able to provide all those
features, because core hasn't yet, is imo entirely the wrong thing to
concentrate upon.  And it's not like things largely haven't been
implemented in pg_basebackup for fundamental architectural reasons.
It's because we've built like 5 different external tools with randomly
differing featureset and licenses.

Greetings,

Andres Freund

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
>> I think we can get out of this whole class of problems by forbidding the
>> TABLESPACE clause for partitioned rels from mentioning the database
>> tablespace -- that is, users either mention some *other* tablespace, or
>> partitions follow default_tablespace like everybody else.  AFAICS with
>> that restriction this whole problem does not arise, and the patch may
>> become simpler.  I'll give it a spin.

> Why is the obvious answer is to not just remove the whole tablespace
> inheritance behaviour? It's obviously ambiguous and hard to get right.
> I still don't see any usecase that even comes close to making the
> inheritance useful enough to justify the amount of work (code, tests,
> bugfixes) and docs that are required.

Yeah, that's where I'm at as well.  Alvaro's proposal could be made
to work perhaps, but I think it would still end up with some odd
corner-case behaviors.  One example is that "TABLESPACE X" would
be allowed if the database's default tablespace is Y, but if you
try to dump and restore into a database whose default is X, it'd be
rejected (?).  The results after ALTER DATABASE ... SET TABLESPACE X
are unclear too.

regards, tom lane

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > > One could instead just do a straightforward extension
> > > to the existing BASE_BACKUP command to enable incremental backup.
> >
> > Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> > that we're talking too high-level here and it's making the discussion
> > more difficult than it would be if we were to put together specific
> > ideas and then discuss them.
> >
> > One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> > optional parameter and then having the database server scan the entire
> > cluster and send a tarball which contains essentially a 'diff' file of
> > some kind for each file where we can construct a diff based on the LSN,
> > and then the complete contents of the file for everything else that
> > needs to be in the backup.
> 
> /me scratches head.  Isn't that pretty much what I described in my
> original post?  I even described what that "'diff' file of some kind"
> would look like in some detail in the paragraph of that emailed
> numbered "2.", and I described the reasons for that choice at length
> in 
> http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com
> 
> I can't figure out how I'm managing to be so unclear about things
> about which I thought I'd been rather explicit.

There was basically zero discussion about what things would look like at
a protocol level (I went back and skimmed over the thread before sending
my last email to specifically see if I was going to get this response
back..).  I get the idea behind the diff file, the contents of which I
wasn't getting into above.

> > So, sure, that would work, but it wouldn't be able to be parallelized
> > and I don't think it'd end up being very exciting for the external tools
> > because of that, but it would be fine for pg_basebackup.
> 
> Stop being such a pessimist.  Yes, if we only add the option to the
> BASE_BACKUP command, it won't directly be very exciting for external
> tools, but a lot of the work that is needed to do things that ARE
> exciting for external tools will have been done.  For instance, if the
> work to figure out which blocks have been modified via WAL-scanning
> gets done, and initially that's only exposed via BASE_BACKUP, it won't
> be much work for somebody to write code for a new code that exposes
> that information directly through some new replication command.
> There's a difference between something that's going in the wrong
> direction and something that's going in the right direction but not as
> far or as fast as you'd like.  And I'm 99% sure that everything I'm
> proposing here falls in the latter category rather than the former.

I didn't mean to imply that you're doing in the wrong direction here and
I thought I said somewhere in my last email more-or-less exactly the
same, that a great deal of the work needed for block-level incremental
backup would be done, but specifically that this proposal wouldn't allow
external tools to leverage that.  It sounds like what you're suggesting
now is that you're happy to implement the backend code, expose it in a
way that works just for pg_basebackup, and that if someone else wants to
add things to the protocol to make it easier for external tools to
leverage, great.  All I can say is that that's basically how we ended up
in the situation we're in today where pg_basebackup doesn't support
parallel backup but a bunch of external tools do and they don't go
through the backend to get there, even though they'd probably prefer to.

> > On the other hand, if you added new commands for 'list of files changed
> > since this LSN' and 'give me this file' and 'give me this file with the
> > changes in it since this LSN', then pg_basebackup could work with that
> > pretty easily in a single-threaded model (maybe with two connections to
> > the backend, but still in a single process, or maybe just by slurping up
> > the file list and then asking for each one) and the external tools could
> > leverage those new capabilities too for their backups, both full backups
> > and incremental ones.  This also wouldn't have to change how
> > pg_basebackup does full backups today one bit, so what we're really
> > talking about here is the direction to take the new code that's being
> > written, not about rewriting existing code.  I agree that it'd be a bit
> > more work...  but hopefully not *that* much more, and it would mean we
> > could later add parallel backup to pg_basebackup more easily too, if we
> > wanted to.
> 
> For purposes of implementing parallel pg_basebackup, it would probably
> be better if the server rather than the client decided which files to
> send via which connection.  If the client decides, then every time the
> server finishes sending a file, the client has to request another
> file, and that introduces some latency: after the server finishes
> sending each file, it has

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 13:18:18 -0400, Tom Lane wrote:
> Andres Freund  writes:
> >> (Possibly I'd not think this if I weren't fresh off a couple of days
> >> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
> >> I think that believing that that code does not and never will have
> >> any bugs is just damfool.)
> 
> > But there's plenty places where we rely on NOT NULL actually working?
> 
> I do not think there are any other places where we make this particular
> assumption.

Sure, not exactly the assumtion that JITed deforming benefits from, but
as far as I can tell, plenty things would be broken just as well if we
allowed NOT NULL columns to not be present (whether "physically" present
or present via atthasmissing) for tuples in a table. Fast defaults
wouldn't work, Assert(!isnull) checks would fire, primary keys would be
broken etc.



> In hopes of putting some fear into you too, I exhibit the following
> behavior, which is not a bug according to our current definitions:
> 
> regression=# create table pp(f1 int);
> CREATE TABLE
> regression=# create table cc() inherits (pp);
> CREATE TABLE
> regression=# insert into cc values(1);
> INSERT 0 1
> regression=# insert into cc values(2);
> INSERT 0 1
> regression=# insert into cc values(null);
> INSERT 0 1
> regression=# alter table pp add column f2 text;
> ALTER TABLE
> regression=# alter table pp add column f3 text;
> ALTER TABLE
> regression=# alter table only pp alter f3 set not null;
> ALTER TABLE
> regression=# select * from pp;
>  f1 | f2 | f3 
> ++
>   1 || 
>   2 || 
> || 
> (3 rows)
> 
> The tuples coming out of cc will still have natts = 1, I believe.
> If they were deformed according to pp's tupdesc, there'd be a
> problem.  Now, we shouldn't do that, because this is not the only
> possible discrepancy between parent and child tupdescs --- but
> I think this example shows that attnotnull is a lot spongier
> than you are assuming, even without considering the possibility
> of outright bugs.

Unortunately it doesn't really put the fear into me - given that
attribute numbers don't even have to match between inheritance children,
making inferrences about the length of the NULL bitmap seems peanuts
compared to the breakage of using the wrong tupdesc to deform.

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:17:48 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
> >> I wonder
> >> also if it wouldn't be smart to explicitly check that the "guaranteeing"
> >> column is not attisdropped.
> 
> > Yea, that probably would be smart.  I don't think there's an active
> > problem, because we remove NOT NULL when deleting an attribute, but it
> > seems good to be doubly sure / explain why that's safe:
> > /* Remove any NOT NULL constraint the column may have */
> > attStruct->attnotnull = false;
> > I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
> > assume column presence?
> 
> I'd just make the code look like
> 
> /*
>  * If it's NOT NULL then it must be present in every tuple,
>  * unless there's a "missing" entry that could provide a non-null
>  * value for it.  Out of paranoia, also check !attisdropped.
>  */
> if (att->attnotnull &&
> !att->atthasmissing &&
> !att->attisdropped)
> guaranteed_column_number = attnum;
> 
> I don't think the extra check is so expensive as to be worth obsessing
> over.

Oh, yea, the cost is irrelevant here - it's one-off work basically, and
pales in comparison to the cost of JITing. I was more thinking about
whether it's worth "escalating" the violation of assumptions.

Greetings,

Andres Freund

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 14:16:28 -0400, Alvaro Herrera wrote:
> On 2019-Apr-22, Robert Haas wrote:
> 
> > PostgreSQL has historically and very deliberately *not made a
> > distinction* between "this object is in the default tablespace" and
> > "this object is in tablespace X which happens to be the default."  I
> > think that it's too late to invent such a distinction for reasons of
> > backward compatibility -- and if we were going to do it, surely it
> > would need to exist for both partitioned tables and the partitions
> > themselves. Otherwise it just produces more strange inconsistencies.
> 
> Yeah, this is probably right.  (I don't think it's the same thing that
> Tom was saying, though, or at least I didn't understand his argument
> this way.)
> 
> I think we can get out of this whole class of problems by forbidding the
> TABLESPACE clause for partitioned rels from mentioning the database
> tablespace -- that is, users either mention some *other* tablespace, or
> partitions follow default_tablespace like everybody else.  AFAICS with
> that restriction this whole problem does not arise, and the patch may
> become simpler.  I'll give it a spin.

Why is the obvious answer is to not just remove the whole tablespace
inheritance behaviour? It's obviously ambiguous and hard to get right.
I still don't see any usecase that even comes close to making the
inheritance useful enough to justify the amount of work (code, tests,
bugfixes) and docs that are required.

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
>> I wonder
>> also if it wouldn't be smart to explicitly check that the "guaranteeing"
>> column is not attisdropped.

> Yea, that probably would be smart.  I don't think there's an active
> problem, because we remove NOT NULL when deleting an attribute, but it
> seems good to be doubly sure / explain why that's safe:
>   /* Remove any NOT NULL constraint the column may have */
>   attStruct->attnotnull = false;
> I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
> assume column presence?

I'd just make the code look like

/*
 * If it's NOT NULL then it must be present in every tuple,
 * unless there's a "missing" entry that could provide a non-null
 * value for it.  Out of paranoia, also check !attisdropped.
 */
if (att->attnotnull &&
!att->atthasmissing &&
!att->attisdropped)
guaranteed_column_number = attnum;

I don't think the extra check is so expensive as to be worth obsessing
over.

regards, tom lane

Re: pg_dump is broken for partition tablespaces

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Robert Haas wrote:

> PostgreSQL has historically and very deliberately *not made a
> distinction* between "this object is in the default tablespace" and
> "this object is in tablespace X which happens to be the default."  I
> think that it's too late to invent such a distinction for reasons of
> backward compatibility -- and if we were going to do it, surely it
> would need to exist for both partitioned tables and the partitions
> themselves. Otherwise it just produces more strange inconsistencies.

Yeah, this is probably right.  (I don't think it's the same thing that
Tom was saying, though, or at least I didn't understand his argument
this way.)

I think we can get out of this whole class of problems by forbidding the
TABLESPACE clause for partitioned rels from mentioning the database
tablespace -- that is, users either mention some *other* tablespace, or
partitions follow default_tablespace like everybody else.  AFAICS with
that restriction this whole problem does not arise, and the patch may
become simpler.  I'll give it a spin.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Do CustomScan as not projection capable node

2019-04-22 Thread Tom Lane

Andrey Lepikhov  writes:
> It is possible that custom_scan_tlist is designed too nontrivially, and 
> it is possible that it needs some comments describing in more detail how 
> to use it.

I totally buy the argument that the custom scan stuff is
underdocumented :-(.

FWIW, if we did have a use-case showing that somebody would like to make
custom scans that can't project, the way to do it would be to add a flag
bit showing whether a particular CustomPath/CustomScan could project or
not.  Not to assume that they all can't.  This wouldn't be that much code
really, but I'd still like to see a plausible use-case before adding it,
because it'd be a small API break for existing CustomPath providers.

regards, tom lane

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> > I agree that we don't want another implementation and that there's a lot
> > that we want to do to improve replay performance.  We've already got
> > frontend tools which work with multiple execution threads, so I'm not
> > sure I get the "not easily feasible" bit, and the argument about the
> > checkpointer seems largely related to that (as in- if we didn't have
> > multiple threads/processes then things would perform quite badly...  but
> > we can and do have multiple threads/processes in frontend tools today,
> > even in pg_basebackup).
> 
> You need not just multiple execution threads, but basically a new
> implementation of shared buffers, locking, process monitoring, with most
> of the related infrastructure. You're literally talking about
> reimplementing a very substantial portion of the backend.  I'm not sure
> I can transport in written words - via a public medium - how bad an idea
> it would be to go there.

Yes, there'd be some need for locking and process monitoring, though if
we aren't supporting ongoing read queries at the same time, there's a
whole bunch of things that we don't need from the existing backend.

> > > Which I think is entirely reasonable. With the 'consistent' and LSN
> > > recovery targets one already can get most of what's needed from such a
> > > tool, anyway.  I'd argue the biggest issue there is that there's no
> > > equivalent to starting postgres with a private socket directory on
> > > windows, and perhaps an option or two making it easier to start postgres
> > > in a "private" mode for things like this.
> > 
> > This would mean building in a way to do parallel WAL replay into the
> > server binary though, as discussed above, and it seems like making that
> > work in a way that allows us to still be available as a read-only
> > standby would be quite a bit more difficult.  We could possibly support
> > parallel WAL replay only when we aren't a replica but from the same
> > binary.
> 
> I'm doubtful that we should try to implement parallel WAL apply that
> can't support HS - a substantial portion of the the logic to avoid
> issues around relfilenode reuse, consistency etc is going to be to be
> necessary for non-HS aware apply anyway.  But if somebody had a concrete
> proposal for something that's fundamentally only doable without HS, I
> could be convinced.

I'd certainly prefer that we support parallel WAL replay *with* HS, that
just seems like a much larger problem, but I'd be quite happy to be told
that it wouldn't be that much harder.

> > A lot of this part of the discussion feels like a tangent though, unless
> > I'm missing something.
> 
> I'm replying to:
> 
> On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> > Wow.  I have to admit that I feel completely opposite of that- I'd
> > *love* to have an independent tool (which ideally uses the same code
> > through the common library, or similar) that can be run to apply WAL.
> 
> And I'm basically saying that anything that starts from this premise is
> fatally flawed (in the ex falso quodlibet kind of sense ;)).

I'd just say that it'd be... difficult. :)

> > The "WAL compression" tool contemplated
> > previously would be much simpler and not the full-blown WAL replay
> > capability, which would be left to the server, unless you're suggesting
> > that even that should be exclusively the purview of the backend?  Though
> > that ship's already sailed, given that external projects have
> > implemented it.
> 
> I'm extremely doubtful of such tools (but it's not what I was responding
> too, see above). I'd be extremely surprised if even one of them came
> close to being correct. The old FPI removal tool had data corrupting
> bugs left and right.

I have concerns about it myself, which is why I'd actually really like
to see something in core that does it, and does it the right way, that
other projects could then leverage (ideally by just linking into the
library without having to rewrite what's in core, though that might not
be an option for things like WAL-G that are in Go and possibly don't
want to link in some C library).

> > Having a library to provide that which external
> > projects could leverage would be nicer than having everyone write their
> > own version.
> 
> No, I don't think that's necessarily true. Something complicated that's
> hard to get right doesn't have to be provided by core. Even if other
> projects decide that their risk/reward assesment is different than core
> postgres'. We don't have to take on all kind of work and complexity for
> external tools.

No, it doesn't have to be provided by core, but I sure would like it to
be and I'd be much more comfortable if it was because then we'd also
take care to not break whatever assumptions are made (or to do so in a
way that can be detected and/or handled) as new code is written.  As
discussed above, as long as it isn't provided by core

Re: Patch: doc for pg_logical_emit_message()

2019-04-22 Thread Fujii Masao

On Fri, Apr 19, 2019 at 3:21 PM Matsumura, Ryo
 wrote:
>
> Hi, Hackers
>
> pg_logical_emit_message() can be used by any user,
> but the following document says that it can be used by only superuser.
>
> > Table 9.88. Replication SQL Functions
> > Use of these functions is restricted to superusers.
>
> I think that pg_logicl_emit_message() should be used by any user.
> Therefore, I attach the document patch.

Thanks for the patch!

Use of not only pg_logical_emit_message() but also other replication
functions in Table 9.88 is not restricted to superuser. For example,
replication role can execute pg_create_physical_replication_slot().
So I think that the patch should fix also the description for those
replication functions. Thought?

Regards,

-- 
Fujii Masao

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 13:27:17 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > The computation of that variable above has:
> 
> >  * If the column is possibly missing, we can't rely on its (or
> >  * subsequent) NOT NULL constraints to indicate minimum 
> > attributes in
> >  * the tuple, so stop here.
> >  */
> > if (att->atthasmissing)
> > break;
> 
> BTW, why do we have to stop?  ISTM that a not-null column without
> atthasmissing is enough to prove this, regardless of the state of prior
> columns.  (This is assuming that you trust attnotnull for this, which
> as I said I don't, but that's not relevant to this question.)

Are you wondering if we could also use this kind of logic to infer the
length of the null bitmap if there's preceding columns with
atthasmissing true as long as there's a later !hasmissing column that's
NOT NULL?  Right. The logic could be made more powerful - I implemented
the above after Andrew's commit of fast-not-null broke JIT (not because
of that logic, but because it simply didn't look up the missing
columns).  I assume it doesn't terribly matter to be fast once
attributes after a previously missing one are accessed - it's likely not
going to be the hotly accessed data?

> I wonder
> also if it wouldn't be smart to explicitly check that the "guaranteeing"
> column is not attisdropped.

Yea, that probably would be smart.  I don't think there's an active
problem, because we remove NOT NULL when deleting an attribute, but it
seems good to be doubly sure / explain why that's safe:

/* Remove any NOT NULL constraint the column may have */
attStruct->attnotnull = false;

I'm a bit unsure whether to make it an assert, elog(ERROR) or just not
assume column presence?

Greetings,

Andres Freund

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread Pavel Stehule

Hi

po 22. 4. 2019 v 19:20 odesílatel Michał "phoe" Herda 
napsal:

> Hey!
>
> OK - thank you for the update and the explanation.
>
> My reasoning in this case is - if we allow the any[] type to only be
> passed to other functions that accept any[], and disallow any kind of other
> operations on this array (such as retrieving its elements or modifying it),
> I do not yet see any places where it might introduce a performance
> regression. These arguments will literally be pass-only, and since we are
> unable to interact with them in any other way, there will be no possibility
> of type mismatches and therefore for performance penalties.
>
> This approach puts all the heavy work on the plpgsql compiler - it will
> need to ensure that, if there is a any[] or VARIADIC any variable in a
> function arglist, it must NOT be accessed in any way, and can only be
> passed to other functions which accept any[] or VARIADIC any.
>
PLpgSQL compiler knows nothing about a expressions - the compiler process
only plpgsql statements. Expressions are processed at runtime only by SQL
parser and executor.

It is good to start with plpgsql codes -
https://github.com/postgres/postgres/tree/master/src/pl/plpgsql/src

you can see there, so plpgsql is very different from other compilers. It
just glue of SQL expressions or queries, that are black box for PLpgSQL
compiler and executor.

Just any[] is not plpgsql way. For your case you should to use a overloading

create or replace function fx(fmt text, par text)
returns void as $$
begin
  raise notice '%', format(fmt, par);
end;
$$ language plpgsql;

create or replace function fx(fmt text, par numeric)
returns void as $$
begin
  raise notice '%', format(fmt, par);
end;
$$ language plpgsql;

There is another limit, you cannot to declare function parameter type that
enforce explicit casting

can be nice (but it is strange idea) to have some other flags for arguments

CREATE OR REPLACE FUNCTION gateway_error(fmt text, par text FORCE EXPLICIT
CAST)
...

Regards

Pavel


> BR
> ~phoe
> On 22.04.2019 12:09, Pavel Stehule wrote:
>
> Hi
>
> po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda 
> napsal:
>
>> Hey everyone,
>>
>> I am writing a plpgsql function that (to greatly simplify) raises an
>> exception with a formatted* message. Ideally, I should be able to call
>> it with raise_exception('The person %I has only %I bananas.', 'Fred',
>> 8), which mimics the format(text, any[]) calling convention.
>>
>> Here is where I have encountered a limitation of PostgreSQL's design:
>> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
>> explicitly that, "At present most procedural languages forbid use of a
>> pseudo-type as an argument type".
>>
>> My reasoning is that I should be able to accept a value of some type if
>> all I do is passing it to a function that accepts exactly that type,
>> such as format(text, any[]). Given the technical reality, I assume that
>> I wouldn't be able to do anything else with that value, but that is
>> fine, since I don't have to do anything with it regardless.
>>
>> BR
>> Michał "phoe" Herda
>>
>> *I do not want to use the obvious solution of
>> raise_exception(format(...)) because the argument to that function is
>> the error ID that is then looked up in a table from which the error
>> message and sqlstate are retrieved. My full code is in the attached SQL
>> file. Once it is executed:
>>
>> SELECT gateway_error('user_does_not_exist', '2'); -- works but is
>> unnatural,
>> SELECT gateway_error('user_does_not_exist', 2); -- is natural but
>> doesn't work.
>>
>
> It is known problem, and fix is not easy.
>
> Any expressions inside plpgsql are simple queries like SELECT expr, and
> they are executed same pipeline like queries.
>
> The plans of these queries are stored and reused. Originally these plans
> disallow any changes, now some changes are supported, but parameters should
> be same all time. This is ensured by disallowing "any" type.
>
> Other polymorphic types are very static, so there is not described risk.
>
> Probably some enhancement can be in this are. The plan can be re-planed
> after some change - but it can has lot of performance impacts. It is long
> open topic. Some changes in direction to dynamic languages can increase
> cost of  some future optimization to higher performance :-(.
>
> Regards
>
> Pavel
>
>
>
>
>
>

Re: [PATCH v1] Add \echo_stderr to psql

2019-04-22 Thread Fabien COELHO





   \warn ...
   \warning ...


These two seem about the best to me, drawing from the perl warn command.


Yep, I was thinking of perl & gmake. Maybe the 4 letter option is better
because its the same length as "echo".


I suppose we could go the bash &2 route here, but I don't want to.


I agree on this one.


Please find attached v2, name is now \warn.

How might we test this portably?


TAP testing? see pgbench which has tap test which can test stdout & stderr 
by calling utility command_checks_all, the same could be done with psql.


--
Fabien.

Re: Allow any[] as input arguments for sql/plpgsql functions to mimic format()

2019-04-22 Thread phoe

Hey!

OK - thank you for the update and the explanation.

My reasoning in this case is - if we allow the any[] type to only be
passed to other functions that accept any[], and disallow any kind of
other operations on this array (such as retrieving its elements or
modifying it), I do not yet see any places where it might introduce a
performance regression. These arguments will literally be pass-only, and
since we are unable to interact with them in any other way, there will
be no possibility of type mismatches and therefore for performance
penalties.

This approach puts all the heavy work on the plpgsql compiler - it will
need to ensure that, if there is a any[] or VARIADIC any variable in a
function arglist, it must NOT be accessed in any way, and can only be
passed to other functions which accept any[] or VARIADIC any.

BR
~phoe

On 22.04.2019 12:09, Pavel Stehule wrote:
> Hi
>
> po 22. 4. 2019 v 11:27 odesílatel Michał "phoe" Herda
> mailto:p...@disroot.org>> napsal:
>
> Hey everyone,
>
> I am writing a plpgsql function that (to greatly simplify) raises an
> exception with a formatted* message. Ideally, I should be able to call
> it with raise_exception('The person %I has only %I bananas.', 'Fred',
> 8), which mimics the format(text, any[]) calling convention.
>
> Here is where I have encountered a limitation of PostgreSQL's design:
> https://www.postgresql.org/docs/11/datatype-pseudo.html mentions
> explicitly that, "At present most procedural languages forbid use of a
> pseudo-type as an argument type".
>
> My reasoning is that I should be able to accept a value of some
> type if
> all I do is passing it to a function that accepts exactly that type,
> such as format(text, any[]). Given the technical reality, I assume
> that
> I wouldn't be able to do anything else with that value, but that is
> fine, since I don't have to do anything with it regardless.
>
> BR
> Michał "phoe" Herda
>
> *I do not want to use the obvious solution of
> raise_exception(format(...)) because the argument to that function is
> the error ID that is then looked up in a table from which the error
> message and sqlstate are retrieved. My full code is in the
> attached SQL
> file. Once it is executed:
>
> SELECT gateway_error('user_does_not_exist', '2'); -- works but is
> unnatural,
> SELECT gateway_error('user_does_not_exist', 2); -- is natural but
> doesn't work.
>
>
> It is known problem, and fix is not easy.
>
> Any expressions inside plpgsql are simple queries like SELECT expr,
> and they are executed same pipeline like queries.
>
> The plans of these queries are stored and reused. Originally these
> plans disallow any changes, now some changes are supported, but
> parameters should be same all time. This is ensured by disallowing
> "any" type.
>
> Other polymorphic types are very static, so there is not described risk.
>
> Probably some enhancement can be in this are. The plan can be
> re-planed after some change - but it can has lot of performance
> impacts. It is long open topic. Some changes in direction to dynamic
> languages can increase cost of  some future optimization to higher
> performance :-(.
>
> Regards
>
> Pavel
>
>
>
>

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 10:32 AM Stephen Frost  wrote:
> Yes, global indexes for partitioned tables could potentially be simpler
> than the logical row identifiers, but maybe it'd be useful to just have
> one implementation based around logical row identifiers which ends up
> working for global indexes as well as the other types of indexes and
> table access methods.

Maybe so. I think that we'd have to actually try it out to know for sure.

> If we thought that the 'single-byte' partition number covered enough
> use-cases then we could possibly consider supporting them for partitions
> by just 'stealing' a byte from BlockIdData and having the per-partition
> size be limited to 4TB when a global index exists on the partitioned
> table.

I don't think that that would make it any easier to implement.

> This sounds pretty good to me, though I'm not nearly as close to the
> code there as you are.

I'm slightly concerned that I may have broken an index_form_tuple()
caller from some other access method, but they all seem to not trust
index_form_tuple() to have MAXALIGN()'d on their behalf, just like
nbtree (nbtree won't when I'm done, though, because it will actively
try to preserve the "real" tuple size). It's convenient to me that no
caller seems to rely on the index_form_tuple() MAXALIGN() that I want
to remove.

-- 
Peter Geoghegan

Re: block-level incremental backup

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 1:08 PM Stephen Frost  wrote:
> > I think we're getting closer to a meeting of the minds here, but I
> > don't think it's intrinsically necessary to rewrite the whole method
> > of operation of pg_basebackup to implement incremental backup in a
> > sensible way.
>
> It wasn't my intent to imply that the whole method of operation of
> pg_basebackup would have to change for this.

Cool.

> > One could instead just do a straightforward extension
> > to the existing BASE_BACKUP command to enable incremental backup.
>
> Ok, how do you envision that?  As I mentioned up-thread, I am concerned
> that we're talking too high-level here and it's making the discussion
> more difficult than it would be if we were to put together specific
> ideas and then discuss them.
>
> One way I can imagine to extend BASE_BACKUP is by adding LSN as an
> optional parameter and then having the database server scan the entire
> cluster and send a tarball which contains essentially a 'diff' file of
> some kind for each file where we can construct a diff based on the LSN,
> and then the complete contents of the file for everything else that
> needs to be in the backup.

/me scratches head.  Isn't that pretty much what I described in my
original post?  I even described what that "'diff' file of some kind"
would look like in some detail in the paragraph of that emailed
numbered "2.", and I described the reasons for that choice at length
in 
http://postgr.es/m/ca+tgmozrqdv-tb8ny9p+1pqlqkxp5f1afghuohh5qt6ewdk...@mail.gmail.com

I can't figure out how I'm managing to be so unclear about things
about which I thought I'd been rather explicit.

> So, sure, that would work, but it wouldn't be able to be parallelized
> and I don't think it'd end up being very exciting for the external tools
> because of that, but it would be fine for pg_basebackup.

Stop being such a pessimist.  Yes, if we only add the option to the
BASE_BACKUP command, it won't directly be very exciting for external
tools, but a lot of the work that is needed to do things that ARE
exciting for external tools will have been done.  For instance, if the
work to figure out which blocks have been modified via WAL-scanning
gets done, and initially that's only exposed via BASE_BACKUP, it won't
be much work for somebody to write code for a new code that exposes
that information directly through some new replication command.
There's a difference between something that's going in the wrong
direction and something that's going in the right direction but not as
far or as fast as you'd like.  And I'm 99% sure that everything I'm
proposing here falls in the latter category rather than the former.

> On the other hand, if you added new commands for 'list of files changed
> since this LSN' and 'give me this file' and 'give me this file with the
> changes in it since this LSN', then pg_basebackup could work with that
> pretty easily in a single-threaded model (maybe with two connections to
> the backend, but still in a single process, or maybe just by slurping up
> the file list and then asking for each one) and the external tools could
> leverage those new capabilities too for their backups, both full backups
> and incremental ones.  This also wouldn't have to change how
> pg_basebackup does full backups today one bit, so what we're really
> talking about here is the direction to take the new code that's being
> written, not about rewriting existing code.  I agree that it'd be a bit
> more work...  but hopefully not *that* much more, and it would mean we
> could later add parallel backup to pg_basebackup more easily too, if we
> wanted to.

For purposes of implementing parallel pg_basebackup, it would probably
be better if the server rather than the client decided which files to
send via which connection.  If the client decides, then every time the
server finishes sending a file, the client has to request another
file, and that introduces some latency: after the server finishes
sending each file, it has to wait for the client to finish receiving
the data, and it has to wait for the client to tell it what file to
send next.  If the server decides, then it can just send data at top
speed without a break.  So the ideal interface for pg_basebackup would
really be something like:

START_PARALLEL_BACKUP blah blah PARTICIPANTS 4;

...returning a cookie that can be then be used by each participant for
an argument to a new commands:

JOIN_PARALLLEL_BACKUP 'cookie';

However, that is obviously extremely inconvenient for third-party
tools.  It's possible we need both an interface like this -- for use
by parallel pg_basebackup -- and a
START_BACKUP/SEND_FILE_LIST/SEND_FILE_CONTENTS/STOP_BACKUP type
interface for use by external tools.  On the other hand, maybe the
additional overhead caused by managing the list of files to be fetched
on the client side is negligible.  It'd be interesting to see, though,
how busy the server is when running an incremental backup managed by
an external

Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-22 Thread Fabien COELHO




Bonjour Daniel,


When FETCH_COUNT is set, queries combined in a single request don't work
as expected:

\set FETCH_COUNT 10
select pg_sleep(2) \; select 1;

No result is displayed, the pg_sleep(2) is not run, and no error
is shown. That's disconcerting.


Indeed.


Does anyone have thoughts about how to fix this?



ATM I don't see a plausible fix that does not involve the parser
to store the information that it's a multiple-query command and pass
it down somehow to is_select_command().


The lexer (not parser) is called by psql to know where the query stops 
(i.e. waiting for ";"), so it could indeed know whether there are several 
queries.


I added some stuff to extract embedded "\;" for pgbench "\cset", which has 
been removed though, but it is easy to add back a detection of "\;", and 
also to detect select. If the position of the last select is known, the 
cursor can be declared in the right place, which would also solve the 
problem.


Or a more modern approach could be to give up on the cursor-based method 
in favor of PQsetSingleRowMode().


Hmmm. I'm not sure that row count is available under this mode? ISTM that 
the FETCH_COUNT stuff should really batch fetching result by this amount.

I'm not sure of the 1 by 1 row approach.

--
Fabien.

Re: block-level incremental backup

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-19 20:04:41 -0400, Stephen Frost wrote:
> I agree that we don't want another implementation and that there's a lot
> that we want to do to improve replay performance.  We've already got
> frontend tools which work with multiple execution threads, so I'm not
> sure I get the "not easily feasible" bit, and the argument about the
> checkpointer seems largely related to that (as in- if we didn't have
> multiple threads/processes then things would perform quite badly...  but
> we can and do have multiple threads/processes in frontend tools today,
> even in pg_basebackup).

You need not just multiple execution threads, but basically a new
implementation of shared buffers, locking, process monitoring, with most
of the related infrastructure. You're literally talking about
reimplementing a very substantial portion of the backend.  I'm not sure
I can transport in written words - via a public medium - how bad an idea
it would be to go there.

> You certainly bring up some good concerns though and they make me think
> of other bits that would seem like they'd possibly be larger issues for
> a frontend tool- like having a large pool of memory for cacheing (aka
> shared buffers) the changes.  If what we're talking about here is *just*
> replay though, without having the system available for reads, I wonder
> if we might want a different solution there.

No.

> > Which I think is entirely reasonable. With the 'consistent' and LSN
> > recovery targets one already can get most of what's needed from such a
> > tool, anyway.  I'd argue the biggest issue there is that there's no
> > equivalent to starting postgres with a private socket directory on
> > windows, and perhaps an option or two making it easier to start postgres
> > in a "private" mode for things like this.
> 
> This would mean building in a way to do parallel WAL replay into the
> server binary though, as discussed above, and it seems like making that
> work in a way that allows us to still be available as a read-only
> standby would be quite a bit more difficult.  We could possibly support
> parallel WAL replay only when we aren't a replica but from the same
> binary.

I'm doubtful that we should try to implement parallel WAL apply that
can't support HS - a substantial portion of the the logic to avoid
issues around relfilenode reuse, consistency etc is going to be to be
necessary for non-HS aware apply anyway.  But if somebody had a concrete
proposal for something that's fundamentally only doable without HS, I
could be convinced.

> The concerns mentioned about making it easier to start PG in a
> private mode don't seem too bad but I am not entirely sure that the
> tools which want to leverage that kind of capability would want to have
> to exec out to the PG binary to use it.

Tough luck.  But even leaving infeasability aside, it seems like a quite
bad idea to do this in-process inside a tool that manages backup &
recovery. Creating threads / sub-processes with complicated needs (like
any pared down version of pg to do just recovery would have) from within
a library has substantial complications. So you'd not want to do this
in-process anyway.

> A lot of this part of the discussion feels like a tangent though, unless
> I'm missing something.

I'm replying to:

On 2019-04-17 18:43:10 -0400, Stephen Frost wrote:
> Wow.  I have to admit that I feel completely opposite of that- I'd
> *love* to have an independent tool (which ideally uses the same code
> through the common library, or similar) that can be run to apply WAL.

And I'm basically saying that anything that starts from this premise is
fatally flawed (in the ex falso quodlibet kind of sense ;)).

> The "WAL compression" tool contemplated
> previously would be much simpler and not the full-blown WAL replay
> capability, which would be left to the server, unless you're suggesting
> that even that should be exclusively the purview of the backend?  Though
> that ship's already sailed, given that external projects have
> implemented it.

I'm extremely doubtful of such tools (but it's not what I was responding
too, see above). I'd be extremely surprised if even one of them came
close to being correct. The old FPI removal tool had data corrupting
bugs left and right.

> Having a library to provide that which external
> projects could leverage would be nicer than having everyone write their
> own version.

No, I don't think that's necessarily true. Something complicated that's
hard to get right doesn't have to be provided by core. Even if other
projects decide that their risk/reward assesment is different than core
postgres'. We don't have to take on all kind of work and complexity for
external tools.

Greetings,

Andres Freund

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Stephen Frost

Greetings,

* Peter Geoghegan (p...@bowt.ie) wrote:
> On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost  wrote:
> > This seems like it would be helpful for global indexes as well, wouldn't
> > it?
> 
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

Yes, global indexes for partitioned tables could potentially be simpler
than the logical row identifiers, but maybe it'd be useful to just have
one implementation based around logical row identifiers which ends up
working for global indexes as well as the other types of indexes and
table access methods.

If we thought that the 'single-byte' partition number covered enough
use-cases then we could possibly consider supporting them for partitions
by just 'stealing' a byte from BlockIdData and having the per-partition
size be limited to 4TB when a global index exists on the partitioned
table.  That's certainly not an ideal limitation but it might be
appealing to some users who really would like global indexes and could
possibly require less to implement, though there's a lot of other things
that would have to be done to have global indexes.  Anyhow, just some
random thoughts that I figured I'd share in case there might be
something there worth thinking about.

> > I agree with trying to avoid having padding 'in the wrong place' and if
> > it makes some indexes smaller, great, even if they're unlikely to be
> > interesting in the vast majority of cases, they may still exist out
> > there.  Of course, this is provided that it doesn't overly complicate
> > the code, but it sounds like it wouldn't be too bad in this case.
> 
> Here is what it took:
> 
> * Removed the "conservative" MAXALIGN() within index_form_tuple(),
> bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
> that the first attribute in tuple's data area can safely be accessed
> on alignment-picky platforms, but doesn't do the same with data_len.
> 
> * Removed most of the MAXALIGN()s from nbtinsert.c, except one that
> considers if a page split is required.
> 
> * Didn't change the nbtsplitloc.c code, because we need to assume
> MAXALIGN()'d space quantities there. We continue to not trust the
> reported tuple length to be MAXALIGN()'d, which is now essentially
> rather than just defensive.
> 
> * Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
> whole tuple size in the case where new pivot tuple requires a heap TID
> representation. We access TIDs as 3 2 byte integers, so this is
> necessary for alignment-picky platforms.
> 
> I will pursue this as a project for PostgreSQL 13. It doesn't affect
> on-disk compatibility, because BTreeTupleGetHeapTID() works just as
> well with either the existing scheme, or this new one. Having the
> "real" tuple length available will make it easier to implement "true"
> suffix truncation, where we truncate *within* a text attribute (i.e.
> generate a new, shorter value using new opclass infrastructure).

This sounds pretty good to me, though I'm not nearly as close to the
code there as you are.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> The computation of that variable above has:

>* If the column is possibly missing, we can't rely on its (or
>* subsequent) NOT NULL constraints to indicate minimum 
> attributes in
>* the tuple, so stop here.
>*/
>   if (att->atthasmissing)
>   break;

BTW, why do we have to stop?  ISTM that a not-null column without
atthasmissing is enough to prove this, regardless of the state of prior
columns.  (This is assuming that you trust attnotnull for this, which
as I said I don't, but that's not relevant to this question.)  I wonder
also if it wouldn't be smart to explicitly check that the "guaranteeing"
column is not attisdropped.

> I think just reformulating that to something like

>   /*
>* Check if it's guaranteed that all the desired attributes are 
> available
>* in the tuple (but still possibly NULL), by dint of either the last
>* to-be-deformed column being NOT NULL, or subsequent ones not accessed
>* here being NOT NULL.  If that's not guaranteed the tuple headers 
> natt's
>* has to be checked, and missing attributes potentially have to be
>* fetched (using slot_getmissingattrs().
>   */

> should make that clearer?

OK by me.

regards, tom lane

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 9:35 AM Andres Freund  wrote:
> I, more generally, wonder if there's not a case to squeeze out more
> padding than "just" what you describe (since we IIRC don't frequently
> keep pointers into such tuples anyway, and definitely don't for byval
> attrs). But that's very likely better done separately.

There is one way that that is kind of relevant here. The current
requirement seems to be that *any* sort of tuple header be
MAXALIGN()'d, because in the worst case the first attribute needs to
be accessed at a MAXALIGN()'d boundary on an alignment-picky platform.
That isn't so bad today, since we usually find a reasonably good way
to put those 8 bytes (or 23/24 bytes in the case of heap tuples) to
use. However, with varwidth table identifiers, the only use of those 8
bytes that I can think of is the offset to the identifier (or perhaps
its length), plus the usual t_info stuff. We'd almost invariably waste
4 or 5 bytes, which seems like a problem to me.

-- 
Peter Geoghegan

Re: Do CustomScan as not projection capable node

2019-04-22 Thread Andrey Lepikhov





On 22/04/2019 18:40, Robert Haas wrote:

On Fri, Apr 19, 2019 at 12:45 AM Tom Lane  wrote:

I don't buy this for a minute.  Where do you think projection is
going to happen?  There isn't any existing node type that *couldn't*
support projection if we insisted that that be done across-the-board.
I think it's mostly just a legacy thing that some don't.


I think there may actually be some good reasons for that.  If
something like an Append or Material node projects, it seems to me
that this means that we built the wrong tlist for its input(s).

That justification doesn't apply to custom scans, though.
The main reason for my question was incomplete information about the 
parameter custom_scan_tlist / fdw_scan_tlist.
In the process of testing my custom node, I encountered an error in 
setrefs.c caused by optimization of the projection operation. In order 
to reliably understand how to properly use custom_scan_tlist, I had to 
study in detail the mechanics of the FDW plan generator and now the 
problem is solved.
We have only three references to this parameter in the hackers mailing 
list, a brief reference on postgresql.org and limited comments into two 
patches: 1a8a4e5 and e7cb7ee.
It is possible that custom_scan_tlist is designed too nontrivially, and 
it is possible that it needs some comments describing in more detail how 
to use it.


--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Andres Freund  writes:
> On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
>> But TBH, now that I look at the code, I think the entire optimization
>> is a bad idea and should be removed.  Am I right in thinking that the
>> presence of a wrong attnotnull marker could cause the generated code to
>> actually crash, thanks to not checking the tuple's natts field?  I don't
>> have enough faith in our enforcement of those constraints to want to see
>> JIT taking that risk to save a nanosecond or two.

> It's not a minor optimization, it's very measurable.

Doesn't matter, if it's unsafe.

>> (Possibly I'd not think this if I weren't fresh off a couple of days
>> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
>> I think that believing that that code does not and never will have
>> any bugs is just damfool.)

> But there's plenty places where we rely on NOT NULL actually working?

I do not think there are any other places where we make this particular
assumption.  Given the number of ways in which we rely on there being
natts checks to avoid rewriting tables, I'm very afraid of the idea
that JIT is making more assumptions than the mainline code does.

In hopes of putting some fear into you too, I exhibit the following
behavior, which is not a bug according to our current definitions:

regression=# create table pp(f1 int);
CREATE TABLE
regression=# create table cc() inherits (pp);
CREATE TABLE
regression=# insert into cc values(1);
INSERT 0 1
regression=# insert into cc values(2);
INSERT 0 1
regression=# insert into cc values(null);
INSERT 0 1
regression=# alter table pp add column f2 text;
ALTER TABLE
regression=# alter table pp add column f3 text;
ALTER TABLE
regression=# alter table only pp alter f3 set not null;
ALTER TABLE
regression=# select * from pp;
 f1 | f2 | f3 
++
  1 || 
  2 || 
|| 
(3 rows)

The tuples coming out of cc will still have natts = 1, I believe.
If they were deformed according to pp's tupdesc, there'd be a
problem.  Now, we shouldn't do that, because this is not the only
possible discrepancy between parent and child tupdescs --- but
I think this example shows that attnotnull is a lot spongier
than you are assuming, even without considering the possibility
of outright bugs.

regards, tom lane

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Peter Geoghegan

On Mon, Apr 22, 2019 at 8:36 AM Stephen Frost  wrote:
> This seems like it would be helpful for global indexes as well, wouldn't
> it?

Yes, though that should probably work by reusing what we already do
with heap TID (use standard IndexTuple fields on the leaf level for
heap TID), plus an additional identifier for the partition number that
is located at the physical end of the tuple. IOW, I think that this
might benefit from a design that is half way between what we already
do with heap TIDs and what we would be required to do to make varwidth
logical row identifiers in tables work -- the partition number is
varwidth, though often only a single byte.

> I agree with trying to avoid having padding 'in the wrong place' and if
> it makes some indexes smaller, great, even if they're unlikely to be
> interesting in the vast majority of cases, they may still exist out
> there.  Of course, this is provided that it doesn't overly complicate
> the code, but it sounds like it wouldn't be too bad in this case.

Here is what it took:

* Removed the "conservative" MAXALIGN() within index_form_tuple(),
bringing it in line with heap_form_tuple(), which only MAXALIGN()s so
that the first attribute in tuple's data area can safely be accessed
on alignment-picky platforms, but doesn't do the same with data_len.

* Removed most of the MAXALIGN()s from nbtinsert.c, except one that
considers if a page split is required.

* Didn't change the nbtsplitloc.c code, because we need to assume
MAXALIGN()'d space quantities there. We continue to not trust the
reported tuple length to be MAXALIGN()'d, which is now essentially
rather than just defensive.

* Removed MAXALIGN()s within _bt_truncate(), and SHORTALIGN()'d the
whole tuple size in the case where new pivot tuple requires a heap TID
representation. We access TIDs as 3 2 byte integers, so this is
necessary for alignment-picky platforms.

I will pursue this as a project for PostgreSQL 13. It doesn't affect
on-disk compatibility, because BTreeTupleGetHeapTID() works just as
well with either the existing scheme, or this new one. Having the
"real" tuple length available will make it easier to implement "true"
suffix truncation, where we truncate *within* a text attribute (i.e.
generate a new, shorter value using new opclass infrastructure).

-- 
Peter Geoghegan

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 01:11:22PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> > I assumed the modblock files would be stored in the WAL archive so some
> > external tools could generate incremental backups using just the WAL
> > files.  I assumed they would also be sent to standby servers so
> > incremental backups could be done on standby servers too.
> 
> Yeah, that's another possible approach.  I am not sure what is best.

I am thinking you need to allow any of these, and putting the WAL files
in pg_wal and having them streamed and archived gives that flexibility.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 12:35 PM Bruce Momjian  wrote:
> I assumed the modblock files would be stored in the WAL archive so some
> external tools could generate incremental backups using just the WAL
> files.  I assumed they would also be sent to standby servers so
> incremental backups could be done on standby servers too.

Yeah, that's another possible approach.  I am not sure what is best.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: block-level incremental backup

2019-04-22 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 4:32 PM Stephen Frost  wrote:
> > Having been around for a while working on backup-related things, if I
> > was to implement the protocol for pg_basebackup today, I'd definitely
> > implement "give me a list" and "give me this file" rather than the
> > tar-based approach, because I've learned that people want to be
> > able to do parallel backups and that's a decent way to do that.  I
> > wouldn't set out and implement something new that's there's just no hope
> > of making parallel.  Maybe the first write of pg_basebackup would still
> > be simple and serial since it's certainly more work to make a frontend
> > tool like that work in parallel, but at least the protocol would be
> > ready to support a parallel option being added alter without being
> > rewritten.
> >
> > And that's really what I was trying to get at here- if we've got the
> > choice now to decide what this is going to look like from a protocol
> > level, it'd be great if we could make it able to support being used in a
> > parallel fashion, even if pg_basebackup is still single-threaded.
> 
> I think we're getting closer to a meeting of the minds here, but I
> don't think it's intrinsically necessary to rewrite the whole method
> of operation of pg_basebackup to implement incremental backup in a
> sensible way.  

It wasn't my intent to imply that the whole method of operation of
pg_basebackup would have to change for this.

> One could instead just do a straightforward extension
> to the existing BASE_BACKUP command to enable incremental backup.

Ok, how do you envision that?  As I mentioned up-thread, I am concerned
that we're talking too high-level here and it's making the discussion
more difficult than it would be if we were to put together specific
ideas and then discuss them.

One way I can imagine to extend BASE_BACKUP is by adding LSN as an
optional parameter and then having the database server scan the entire
cluster and send a tarball which contains essentially a 'diff' file of
some kind for each file where we can construct a diff based on the LSN,
and then the complete contents of the file for everything else that
needs to be in the backup.

So, sure, that would work, but it wouldn't be able to be parallelized
and I don't think it'd end up being very exciting for the external tools
because of that, but it would be fine for pg_basebackup.

On the other hand, if you added new commands for 'list of files changed
since this LSN' and 'give me this file' and 'give me this file with the
changes in it since this LSN', then pg_basebackup could work with that
pretty easily in a single-threaded model (maybe with two connections to
the backend, but still in a single process, or maybe just by slurping up
the file list and then asking for each one) and the external tools could
leverage those new capabilities too for their backups, both full backups
and incremental ones.  This also wouldn't have to change how
pg_basebackup does full backups today one bit, so what we're really
talking about here is the direction to take the new code that's being
written, not about rewriting existing code.  I agree that it'd be a bit
more work...  but hopefully not *that* much more, and it would mean we
could later add parallel backup to pg_basebackup more easily too, if we
wanted to.

> Then, to enable parallel full backup and all sorts of out-of-core
> hacking, one could expand the command language to allow tools to
> access individual steps: START_BACKUP, SEND_FILE_LIST,
> SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
> for an appealing project, but I do not think there is a technical
> reason why it has to be done first.  Or for that matter why it has to
> be done second.  As I keep saying, incremental backup and full backup
> are separate projects and I believe it's completely reasonable for
> whoever is doing the work to decide on the order in which they would
> like to do the work.

I didn't mean to imply that one had to be done before the other from a
technical standpoint.  I agree that they don't depend on each other.

You're certainly welcome to do what you would like, I simply wanted to
share my experiences and try to help move this in a direction that would
involve less code rewrite in the future and to have a feature that would
be more appealing to the external tools.

> Having said that, I'm curious what people other than Stephen (and
> other pgbackrest hackers) 

While David and I do talk, we haven't really discussed this proposal all
that much, so please don't assume that he shares my thoughts here.  I'd
also like to hear what others think, particularly those who have been
working in this area.

> think about the relative value of parallel
> backup vs. incremental backup.  Stephen appears quite convinced that
> parallel backup is full of win and incremental backup is a bit of a
> yawn by comparison, and while I certainly would not want to di

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> Attached is a hacky and work-in-progress patch to move fsm map to
> relcache.  This will give you some idea.  I think we need to see if
> there is a need to invalidate the relcache due to this patch.  I think
> some other comments of Andres also need to be addressed, see if you
> can attempt to fix some of them.



>  /*
> @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   BlockNumber blkno,
>   cached_target_block;
>  
> - /* The local map must not be set already. */
> - Assert(!FSM_LOCAL_MAP_EXISTS);
> -
>   /*
>* Starting at the current last block in the relation and working
>* backwards, mark alternating blocks as available.
> @@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   blkno = cur_nblocks - 1;
>   while (true)
>   {
> - fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> + rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
>   if (blkno >= 2)
>   blkno -= 2;
>   else
> @@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   }
>  
>   /* Cache the number of blocks. */
> - fsm_local_map.nblocks = cur_nblocks;
> + rel->fsm_local_map->nblocks = cur_nblocks;
>  
>   /* Set the status of the cached target block to 'unavailable'. */
>   cached_target_block = RelationGetTargetBlock(rel);
>   if (cached_target_block != InvalidBlockNumber &&
>   cached_target_block < cur_nblocks)
> - fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
> + rel->fsm_local_map->map[cached_target_block] = 
> FSM_LOCAL_NOT_AVAIL;
>  }

I think there shouldn't be any need for this anymore. After this change
we shouldn't invalidate the map until there's no space on it - thereby
addressing the cost overhead, and greatly reducing the likelihood that
the local FSM can lead to increased bloat.


>  /*
> @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
>   * This function is used when there is no FSM.
>   */
>  static BlockNumber
> -fsm_local_search(void)
> +fsm_local_search(Relation rel)
>  {
>   BlockNumber target_block;
>  
>   /* Local map must be set by now. */
> - Assert(FSM_LOCAL_MAP_EXISTS);
> + Assert(FSM_LOCAL_MAP_EXISTS(rel));
>  
> - target_block = fsm_local_map.nblocks;
> + target_block = rel->fsm_local_map->nblocks;
>   do
>   {
>   target_block--;
> - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
> + if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
>   return target_block;
>   } while (target_block > 0);
>  
> @@ -1189,7 +1164,22 @@ fsm_local_search(void)
>* first, which would otherwise lead to the same conclusion again and
>* again.
>*/
> - FSMClearLocalMap();
> + fsm_clear_local_map(rel);

I'm not sure I like this. My inclination would be that we should be able
to check the local fsm repeatedly even if there's no space in the
in-memory representation - otherwise the use of the local FSM increases
bloat.

Greetings,

Andres Freund

Trouble with FETCH_COUNT and combined queries in psql

2019-04-22 Thread Daniel Verite

  Hi,

When FETCH_COUNT is set, queries combined in a single request don't work
as expected:

 \set FETCH_COUNT 10
 select pg_sleep(2) \; select 1;

No result is displayed, the pg_sleep(2) is not run, and no error
is shown. That's disconcerting.

The sequence that is sent under the hood is:

#1 BEGIN
#2  DECLARE _psql_cursor NO SCROLL CURSOR FOR
select pg_sleep(2) ; select 1;
#3 CLOSE _psql_cursor
#4 ROLLBACK

The root problem is in deciding that a statement can be run
through a cursor if the query text starts with "select" or "values"
(in is_select_command() in common.c), but not knowing about multiple
queries in the buffer, which are not compatible with the cursor thing.

When sending #2, psql expects the PQexec("DECLARE...") to yield a
PGRES_COMMAND_OK, but it gets a PGRES_TUPLES_OK instead. Given
that, it abandons the cursor, rollbacks the transaction (if
it opened it), and clears out the results of the second select
without displaying them.

If there was already a transaction open, the problem is worse because
it doesn't rollback and we're silently missing an SQL statement that
was possibly meant to change the state of the data, as in
 BEGIN; SELECT compute_something() \; select get_results(); END;

Does anyone have thoughts about how to fix this?
ATM I don't see a plausible fix that does not involve the parser
to store the information that it's a multiple-query command and pass
it down somehow to is_select_command().
Or a more modern approach could be to give up on the
cursor-based method in favor of  PQsetSingleRowMode().
That might be too big a change for a bug fix though,


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Optimizer items in the release notes

2019-04-22 Thread Bruce Momjian

We had a discussion in October about adding more optimizer items to the
release notes:


https://www.postgresql.org/message-id/flat/20181010220601.GA7807%40momjian.us#11d805ea0b0fcd0552dfa99251417cc1

There was no agreement on a change, but if people want to propose a
change, please post here and we can discuss it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 09:43:56 -0700, Andres Freund wrote:
> On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
> > ISTM that Michael's proposed wording change shows that the existing
> > comment is easily misinterpreted.  I don't think these minor grammatical
> > fixes will avoid the misinterpretation problem, and so some more-extensive
> > rewording is called for.
> 
> Fair enough.

The computation of that variable above has:

 * If the column is possibly missing, we can't rely on its (or
 * subsequent) NOT NULL constraints to indicate minimum 
attributes in
 * the tuple, so stop here.
 */
if (att->atthasmissing)
break;

/*
 * Column is NOT NULL and there've been no preceding missing 
columns,
 * it's guaranteed that all columns up to here exist at least 
in the
 * NULL bitmap.
 */
if (att->attnotnull)
guaranteed_column_number = attnum;

and only then the comment referenced in the discussion here follows:
/*
 * Check if's guaranteed the all the desired attributes are available in
 * tuple. If so, we can start deforming. If not, need to make sure to
 * fetch the missing columns.
 */

I think just reformulating that to something like

/*
 * Check if it's guaranteed that all the desired attributes are 
available
 * in the tuple (but still possibly NULL), by dint of either the last
 * to-be-deformed column being NOT NULL, or subsequent ones not accessed
 * here being NOT NULL.  If that's not guaranteed the tuple headers 
natt's
 * has to be checked, and missing attributes potentially have to be
 * fetched (using slot_getmissingattrs().
*/

should make that clearer?

Greetings,

Andres Freund

Re: clean up docs for v12

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-22 12:33:24 -0400, Tom Lane wrote:
> ISTM that Michael's proposed wording change shows that the existing
> comment is easily misinterpreted.  I don't think these minor grammatical
> fixes will avoid the misinterpretation problem, and so some more-extensive
> rewording is called for.

Fair enough.

> But TBH, now that I look at the code, I think the entire optimization
> is a bad idea and should be removed.  Am I right in thinking that the
> presence of a wrong attnotnull marker could cause the generated code to
> actually crash, thanks to not checking the tuple's natts field?  I don't
> have enough faith in our enforcement of those constraints to want to see
> JIT taking that risk to save a nanosecond or two.

It's not a minor optimization, it's very measurable. Without the check
there's no pipeline stall when the memory for the tuple header is not in
the CPU cache (very common, especially for seqscans and such, due to the
"backward" memory location ordering of tuples in seqscans, which CPUs
don't predict). Server grade CPUs of the last ~5 years just march on and
start the work to fetch the first attributes (especially if they're NOT
NULL) - but can't do that if natts has to be checked. And starting to
check the NULL bitmap for NOT NULL attributes, would make that even
worse - and would required if we don't trust attnotnull.

> (Possibly I'd not think this if I weren't fresh off a couple of days
> with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
> I think that believing that that code does not and never will have
> any bugs is just damfool.)

But there's plenty places where we rely on NOT NULL actually working?
We'll return wrong query results, and even crash in non-JIT places
because we thought there was guaranteed to be datum?

Greetings,

Andres Freund

Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-22 Thread Andres Freund

Hi,

On 2019-04-21 17:46:09 -0700, Peter Geoghegan wrote:
> Andres has suggested that I work on teaching nbtree to accommodate
> variable-width, logical table identifiers, such as those required for
> indirect indexes, or clustered indexes, where secondary indexes must
> use a logical primary key value instead of a heap TID.

I think it's two more cases:

- table AMs that want to support tables that are bigger than 32TB. That
  used to be unrealistic, but it's not anymore. Especially when the need
  to VACUUM etc is largely removed / reduced.
- global indexes (for cross-partition unique constraints and such),
  which need a partition identifier as part of the tid (or as part of
  the index key, but I think that actually makes interaction with
  indexam from other layers more complicated - the inside of the index
  maybe may want to represent it as a column, but to the outside that
  ought not to be visible)



> Thoughts?

Seems reasonable to me.

I, more generally, wonder if there's not a case to squeeze out more
padding than "just" what you describe (since we IIRC don't frequently
keep pointers into such tuples anyway, and definitely don't for byval
attrs). But that's very likely better done separately.

Greetings,

Andres Freund

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Bruce Momjian

On Mon, Apr 22, 2019 at 12:15:32PM -0400, Robert Haas wrote:
> On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian  wrote:
> > My point is that you would normally only remove the modblock file when 4
> > is removed because this modblock files is useful for incremental backups
> > from base backups that happened between 1 and 4.
> 
> That's an interesting point.  On the other hand, I think it would be
> typical to want the master to retain .modblock files for much longer
> than it retains WAL segments, and in my design, the WAL archive
> wouldn't see those files at all; they'd be stored on the master.  I
> was actually thinking that they should possibly be stored in a
> separate directory to avoid confusion.

I assumed the modblock files would be stored in the WAL archive so some
external tools could generate incremental backups using just the WAL
files.  I assumed they would also be sent to standby servers so
incremental backups could be done on standby servers too.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: clean up docs for v12

2019-04-22 Thread Tom Lane

Alvaro Herrera  writes:
> On 2019-Apr-22, Andres Freund wrote:
>> On 2019-04-22 14:48:26 +0900, Michael Paquier wrote:
>>> /*
>>> -* Check if's guaranteed the all the desired attributes are available in
>>> -* tuple. If so, we can start deforming. If not, need to make sure to
>>> -* fetch the missing columns.
>>> +* Check if all the desired attributes are available in the tuple.  If 
>>> so,
>>> +* we can start deforming.  If not, we need to make sure to fetch the
>>> +* missing columns.
>>> */

>> That's imo not an improvement. The guaranteed bit is actually
>> relevant. What this block is doing is eliding the check against the
>> tuple header for the number of attributes, if NOT NULL attributes for
>> later columns guarantee that the desired columns are present in the NULL
>> bitmap. But the rephrasing makes it sound like we're actually checking
>> against the tuple.
>> 
>> I think it'd be better just to fix s/the all/that all/.

> (and s/if's/if it's/)

ISTM that Michael's proposed wording change shows that the existing
comment is easily misinterpreted.  I don't think these minor grammatical
fixes will avoid the misinterpretation problem, and so some more-extensive
rewording is called for.

But TBH, now that I look at the code, I think the entire optimization
is a bad idea and should be removed.  Am I right in thinking that the
presence of a wrong attnotnull marker could cause the generated code to
actually crash, thanks to not checking the tuple's natts field?  I don't
have enough faith in our enforcement of those constraints to want to see
JIT taking that risk to save a nanosecond or two.

(Possibly I'd not think this if I weren't fresh off a couple of days
with my nose in the ALTER TABLE SET NOT NULL code.  But right now,
I think that believing that that code does not and never will have
any bugs is just damfool.)

regards, tom lane

Re: clean up docs for v12

2019-04-22 Thread Alvaro Herrera

On 2019-Apr-22, Andres Freund wrote:

> On 2019-04-22 14:48:26 +0900, Michael Paquier wrote:

> > /*
> > -* Check if's guaranteed the all the desired attributes are available in
> > -* tuple. If so, we can start deforming. If not, need to make sure to
> > -* fetch the missing columns.
> > +* Check if all the desired attributes are available in the tuple.  If 
> > so,
> > +* we can start deforming.  If not, we need to make sure to fetch the
> > +* missing columns.
> >  */
> 
> That's imo not an improvement. The guaranteed bit is actually
> relevant. What this block is doing is eliding the check against the
> tuple header for the number of attributes, if NOT NULL attributes for
> later columns guarantee that the desired columns are present in the NULL
> bitmap. But the rephrasing makes it sound like we're actually checking
> against the tuple.
> 
> I think it'd be better just to fix s/the all/that all/.

(and s/if's/if it's/)

> 
> > if ((natts - 1) <= guaranteed_column_number)
> > {
> > @@ -383,7 +383,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc 
> > desc,
> >  
> > /*
> >  * If this is the first attribute, slot->tts_nvalid was 0. 
> > Therefore
> > -* reset offset to 0 to, it be from a previous execution.
> > +* reset offset to 0 too, as it may be from a previous 
> > execution.
> >  */
> > if (attnum == 0)
> > {
> 
> That obviously makes sense.

Hmm, I think "as it *is*", not "as it *may be*", right?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: finding changed blocks using WAL scanning

2019-04-22 Thread Robert Haas

On Mon, Apr 22, 2019 at 11:48 AM Bruce Momjian  wrote:
> My point is that most WAL archive tools will order and remove files
> based on their lexical ordering, so if you put the start first, the file
> will normally be removed when it should be kept, e.g., if you have WAL
> files like:
>
> 00010001
> 00010002
> 00010003
> 00010004
> 00010005
>
> putting the start first and archiving some wal would lead to:
>
> 00010001-00010004.modblock
> 00010003
> 00010004
> 00010005
>
> We removed 1 and 2, but kept the modblock file, which looks out of
> order. Having the end at the start would have:
>
> 00010003
> 00010004
> 00010004-00010001.modblock
> 00010005
>
> My point is that you would normally only remove the modblock file when 4
> is removed because this modblock files is useful for incremental backups
> from base backups that happened between 1 and 4.

That's an interesting point.  On the other hand, I think it would be
typical to want the master to retain .modblock files for much longer
than it retains WAL segments, and in my design, the WAL archive
wouldn't see those files at all; they'd be stored on the master.  I
was actually thinking that they should possibly be stored in a
separate directory to avoid confusion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

1 2 >

1 - 100 of 130 matches

Mail list logo