Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-23 Thread Paul Guo
On Wed, Apr 24, 2019 at 12:49 PM Andres Freund  wrote:

> Hi,
>
> On 2019-04-24 12:46:15 +0800, Paul Guo wrote:
> > This is, in theory, not a 100% bug, but it is probably not unusual to see
> > conflicts of files between postgresql sqls and other applications on the
> > same node so I think the fix is needed. I checked all code that calls
> > AllocateFile() and wrote a simple patch to do sanity check (if the file
> > exists it must be a regular file) for those files which are probably out
> of
> > the postgres data directories which we probably want to ignore. This is
> > actually not a perfect fix since it is not atomic (check and open), but
> it
> > should fix most of the scenarios. To be perfect, we might want to
> refactor
> > AllocateFile() to allow atomic check using either 'x' in fopen()
> > or O_EXCL in open(), also it seems that we might not want to create temp
> > file for AllocateFile() with fixed filenames. This is beyond of this
> patch
> > of course.
>
> This seems like a bad idea to me. IMO we want to support using a pipe
> etc here. If the admin creates a fifo like this without attaching a
> program it seems like it's their fault.
>

Oh, I never know this application scenario before. So yes, for this, we
need to keep the current code logic in copy code.


>
> Greetings,
>
> Andres Freund
>


Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-24 12:46:15 +0800, Paul Guo wrote:
> This is, in theory, not a 100% bug, but it is probably not unusual to see
> conflicts of files between postgresql sqls and other applications on the
> same node so I think the fix is needed. I checked all code that calls
> AllocateFile() and wrote a simple patch to do sanity check (if the file
> exists it must be a regular file) for those files which are probably out of
> the postgres data directories which we probably want to ignore. This is
> actually not a perfect fix since it is not atomic (check and open), but it
> should fix most of the scenarios. To be perfect, we might want to refactor
> AllocateFile() to allow atomic check using either 'x' in fopen()
> or O_EXCL in open(), also it seems that we might not want to create temp
> file for AllocateFile() with fixed filenames. This is beyond of this patch
> of course.

This seems like a bad idea to me. IMO we want to support using a pipe
etc here. If the admin creates a fifo like this without attaching a
program it seems like it's their fault.

Greetings,

Andres Freund




[Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-23 Thread Paul Guo
Hello, Postgres hackers.

I happened to see a hang issue when running a simple copy query. The root
cause and repro way are quite simple.

mkfifo /tmp/a

run sql:
copy (select generate_series(1, 10)) to '/tmp/a';

It hangs at AllocateFile()->fopen() because that file was previously
created as a fifo file, and it is not ctrl+c cancellable (on Linux).

#0  0x7f52df1c45a0 in __open_nocancel () at
../sysdeps/unix/syscall-template.S:81
#1  0x7f52df151f20 in _IO_file_open (is32not64=4, read_write=4,
prot=438, posix_mode=, filename=0x7ffe64199a10
"\360\303[\001", fp=0x1649c40) at fileops.c:229
#2  _IO_new_file_fopen (fp=fp@entry=0x1649c40,
filename=filename@entry=0x15bc3f0
"/tmp/a", mode=, mode@entry=0xaa0bb7 "w",
is32not64=is32not64@entry=1) at fileops.c:339
#3  0x7f52df1465e4 in __fopen_internal (filename=0x15bc3f0 "/tmp/a",
mode=0xaa0bb7 "w", is32=1) at iofopen.c:90
#4  0x007a0e90 in AllocateFile (name=0x15bc3f0 "/tmp/a",
mode=mode@entry=0xaa0bb7 "w") at fd.c:2229
#5  0x005c51b4 in BeginCopyTo (pstate=pstate@entry=0x15b95f0,
rel=rel@entry=0x0, query=, queryRelId=queryRelId@entry=0,
filename=, is_program=, attnamelist=0x0,
options=0x0) at copy.c:1919
#6  0x005c8999 in DoCopy (pstate=pstate@entry=0x15b95f0,
stmt=stmt@entry=0x1596b60, stmt_location=0, stmt_len=48,
processed=processed@entry=0x7ffe64199cd8) at copy.c:1078
#7  0x007d717a in standard_ProcessUtility (pstmt=0x1596ea0,
queryString=0x1595dc0 "copy (select generate_series(1, 10)) to '/tmp/a';",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x1596f80,
completionTag=0x7ffe64199f90 "") at utility.c:551

This is, in theory, not a 100% bug, but it is probably not unusual to see
conflicts of files between postgresql sqls and other applications on the
same node so I think the fix is needed. I checked all code that calls
AllocateFile() and wrote a simple patch to do sanity check (if the file
exists it must be a regular file) for those files which are probably out of
the postgres data directories which we probably want to ignore. This is
actually not a perfect fix since it is not atomic (check and open), but it
should fix most of the scenarios. To be perfect, we might want to refactor
AllocateFile() to allow atomic check using either 'x' in fopen()
or O_EXCL in open(), also it seems that we might not want to create temp
file for AllocateFile() with fixed filenames. This is beyond of this patch
of course.

Thanks.


0001-Make-sure-the-file-is-a-regular-file-if-it-exists-be.patch
Description: Binary data


Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
Sorry, I was in haste.

At Wed, 24 Apr 2019 13:18:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190424.131845.116224815.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 10:05:03 -0700, Andres Freund  wrote 
> in <20190423170503.uw5jxrujqlozg...@alap3.anarazel.de>
> > Hi,
> > 
> > On 2019-04-23 16:08:18 +0900, Michael Paquier wrote:
> > > On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> > > > FWIW, I think the right fix for this is to simply drop the requirement
> > > > that tablespace paths need to be absolute. It's not buying us anything,
> > > > it's just making things more complicated. We should just do a simple
> > > > check against the tablespace being inside PGDATA, and leave it at
> > > > that. Yes, that can be tricked, but so can the current system.
> > > 
> > > convert_and_check_filename() checks after that already, mostly.  For
> > > TAP tests I am not sure that this would help much though as all the
> > > nodes of a given test use the same root path for their data folders,
> > > so you cannot just use "../hoge/" as location.
> > 
> > I don't see the problem here.  Putting the primary and standby PGDATAs
> > into a subdirectory that also can contain a relatively referenced
> > tablespace seems trivial?
> > 
> > > I'm not We already generate a warning when a tablespace is in a data
> > > folder, as this causes issues with recursion lookups of base backups.
> > > What do you mean in this case?  Forbidding the behavior?  -- Michael
> > 
> > I mostly am talking about replacing
> > 
> > Oid
> > CreateTableSpace(CreateTableSpaceStmt *stmt)
> > {
> > ...
> > /*
> >  * Allowing relative paths seems risky
> >  *
> >  * this also helps us ensure that location is not empty or whitespace
> >  */
> > if (!is_absolute_path(location))
> > ereport(ERROR,
> > (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
> >  errmsg("tablespace location must be an 
> > absolute path")));
> > 
> > with a check that forces relative paths to be outside of PGDATA (baring
> > symlinks). As far as I can tell convert_and_check_filename() would check
> > just about the opposite.

We need to adjust relative path between PGDATA-based and
pg_tblspc based. The attached first patch does that.

- I'm not sure it is OK to use getcwd this way. Another issue
  here is is_in_data_directory canonicalizes DataDir
  on-the-fly. It is needed when DataDir contains '/./' or such. I
  think the canonicalization should be done far earlier.

The second attached is TAP change to support tablespaces using
relative tablespaces.

- This is tentative or sample. I'll visit the current discussion thread.

The third is test for this issue.

- Tablespace handling gets easier.

The fourth is the fix for the issue here.

- Not all possible simliar issue is not checked.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
At Tue, 23 Apr 2019 10:05:03 -0700, Andres Freund  wrote in 
<20190423170503.uw5jxrujqlozg...@alap3.anarazel.de>
> Hi,
> 
> On 2019-04-23 16:08:18 +0900, Michael Paquier wrote:
> > On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> > > FWIW, I think the right fix for this is to simply drop the requirement
> > > that tablespace paths need to be absolute. It's not buying us anything,
> > > it's just making things more complicated. We should just do a simple
> > > check against the tablespace being inside PGDATA, and leave it at
> > > that. Yes, that can be tricked, but so can the current system.
> > 
> > convert_and_check_filename() checks after that already, mostly.  For
> > TAP tests I am not sure that this would help much though as all the
> > nodes of a given test use the same root path for their data folders,
> > so you cannot just use "../hoge/" as location.
> 
> I don't see the problem here.  Putting the primary and standby PGDATAs
> into a subdirectory that also can contain a relatively referenced
> tablespace seems trivial?
> 
> > I'm not We already generate a warning when a tablespace is in a data
> > folder, as this causes issues with recursion lookups of base backups.
> > What do you mean in this case?  Forbidding the behavior?  -- Michael
> 
> I mostly am talking about replacing
> 
> Oid
> CreateTableSpace(CreateTableSpaceStmt *stmt)
> {
> ...
>   /*
>* Allowing relative paths seems risky
>*
>* this also helps us ensure that location is not empty or whitespace
>*/
>   if (!is_absolute_path(location))
>   ereport(ERROR,
>   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
>errmsg("tablespace location must be an 
> absolute path")));
> 
> with a check that forces relative paths to be outside of PGDATA (baring
> symlinks). As far as I can tell convert_and_check_filename() would check
> just about the opposite.

We need to adjust relative path between PGDATA-based and
pg_tblspc based. The attached first patch does that.

- I'm not sure it is OK to use getcwd this way.

The second attached is TAP change to support tablespaces using
relative tablespaces. One issue here is is_in_data_directory
canonicalizes DataDir on-the-fly. It is needed when DataDir
contains '/./' or such. I think the canonicalization should be
done far earlier.

- This is tentative or sample. I'll visit the current discussion thread.

The third is test for this issue.

- Tablespace handling gets easier.

The fourth is the fix for the issue here.

- Not all possible simliar issue is not checked.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 714702a6abfe1987eab98020c799b845917af156 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 24 Apr 2019 11:50:41 +0900
Subject: [PATCH 1/4] Allow relative tablespace location paths

Currently relative paths are not allowed as tablespace directory. That
makes tests about tablespaces bothersome but doens't prevent them
points into the data directory perfectly. This patch allows relative
direcotry based on the data directory as tablespace directory. Then
makes the check more strict.
---
 src/backend/access/transam/xlog.c |  8 
 src/backend/commands/tablespace.c | 76 ---
 src/backend/replication/basebackup.c  |  7 +++
 src/backend/utils/adt/misc.c  |  7 +++
 src/bin/pg_basebackup/pg_basebackup.c | 56 +--
 src/test/regress/input/tablespace.source  |  3 ++
 src/test/regress/output/tablespace.source |  5 +-
 7 files changed, 142 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c00b63c751..abc2fd951f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10419,6 +10419,14 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			}
 			linkpath[rllen] = '\0';
 
+			/*
+			 * In the relative path cases, the link target is always prefixed
+			 * by "../" to convert from data directory-based. So we just do
+			 * the reverse here.
+			 */
+			if (strncmp(s, "../", 3) == 0)
+s += 3;
+
 			/*
 			 * Add the escape character '\\' before newline in a string to
 			 * ensure that we can distinguish between the newline in the
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 3784ea4b4f..66a70871e6 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -94,6 +94,42 @@ static void create_tablespace_directories(const char *location,
 			  const Oid tablespaceoid);
 static bool destroy_tablespace_directories(Oid tablespaceoid, bool redo);
 
+/*
+ * Check if the path is in the data directory strictly.
+ */
+static bool
+is_in_data_directory(const char *path)
+{
+	char cwd[MAXPGPATH];
+	char abspath[MAXPGPATH];
+	char absdatadir[MAXPGPATH];
+
+	getcwd(cwd, 

Re: pg_dump partitions can lead to inconsistent state after restore

2019-04-23 Thread Amit Langote
On 2019/04/24 10:19, David Rowley wrote:
> On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera  wrote:
>> Per my comment at https://postgr.es/m/2019045129.GA6126@alvherre.pgsql
>> I think that pg_dump can possibly cause bogus partition definitions,
>> when the users explicitly decide to join tables as partitions that have
>> different column ordering than the parent table.  Any COPY or INSERT
>> command without an explicit column list that tries to put tuples in the
>> table will fail after the restore.
> 
> Yeah, pg_dump itself is broken here, never mind dreaming up some other
> user command.
> 
> We do use a column list when doing COPY, but with --inserts (not
> --column-inserts) we don't include a column list.
> 
> All it takes is:
> 
> postgres=# create table listp (a int, b text) partition by list(a);
> CREATE TABLE
> postgres=# create table listp1 (b text, a int);
> CREATE TABLE
> postgres=# alter table listp attach partition listp1 for values in(1);
> ALTER TABLE
> postgres=# insert into listp values(1,'One');
> INSERT 0 1
> postgres=# \q
> 
> $ createdb test1
> $ pg_dump --inserts postgres | psql test1
> ...
> ERROR:  invalid input syntax for type integer: "One"
> LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);
> 
> That settles the debate on the other thread...

+1 to fixing this, although +0.5 to back-patching.

The reason no one has complained so far of being bitten by this may be
that, as each of one us has said at least once on the other thread, users
are not very likely to create partitions with different column orders to
begin with.  Maybe, that isn't a reason to leave it as is though.

Thanks,
Amit





RE: psql - add SHOW_ALL_RESULTS option

2019-04-23 Thread Iwata, Aya
Hi Fabien, 

I review your patch. 

> Add a few tests for the new feature.
+++ b/src/test/regress/expected/psql.out
@@ -4729,3 +4729,46 @@ drop schema testpart;
 set search_path to default;
 set role to default;
 drop role testrole_partitioning;
+-- 
There is space (+--' '). Please delete it. It is cause of regression test 
failed.

> IMHO this new setting should be on by default: few people know about \; so
> it would not change anything for most, and I do not see why those who use
> it would not be interested by the results of all the queries they asked for.
I agree with your opinion.

I test some query combination case. And I found when warning happen, the 
message is printed in head of results. I think it is not clear in which query 
the warning occurred.
How about print warning message before the query that warning occurred?

For example,
-- devide by ';'
postgres=# BEGIN; BEGIN; SELECT 1 AS one; COMMIT; BEGIN; BEGIN; SELECT 1 AS 
one; COMMIT;
BEGIN
psql: WARNING:  there is already a transaction in progress
BEGIN
 one 
-
   1
(1 row)

COMMIT
BEGIN
psql: WARNING:  there is already a transaction in progress
BEGIN
 one 
-
   1
(1 row)

COMMIT


-- devide by '\;' and set SHOW_RESULT_ALL on
postgres=# \set SHOW_ALL_RESULTS on
postgres=# BEGIN\; BEGIN\; SELECT 1 AS one\; COMMIT\; BEGIN\; BEGIN\; SELECT 1 
AS one\; COMMIT;
psql: WARNING:  there is already a transaction in progress
BEGIN
BEGIN
 one 
-
   1
(1 row)

psql: WARNING:  there is already a transaction in progress
COMMIT
BEGIN
BEGIN
 one 
-
   1
(1 row)

COMMIT

I will check the code soon.

Regards, 
Aya Iwata





RE: Patch: doc for pg_logical_emit_message()

2019-04-23 Thread Matsumura, Ryo
On Tue. Apr. 23, 2019 at 02:59 AM Masao, Fujii
 wrote:

Thank you for the comment.

> So I think that the patch should fix also the description for those
> replication functions. Thought?

I think so too.
I attach a new patch.

Regards
Ryo Matsumura


replication_functions_doc.patch
Description: replication_functions_doc.patch


Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread Amit Langote
On 2019/04/23 20:03, David Rowley wrote:
> On Tue, 23 Apr 2019 at 18:18, Amit Langote
>  wrote:
>>
>> If partitions needed a
>> map in the old database, this patch means that they will *continue* to
>> need it in the new database.
> 
> That's incorrect.

Not completely though, because...

> My point was about dropped columns being removed
> after a dump / reload.  Only binary upgrade mode preserves
> pg_attribute entries for dropped columns. Normal mode does not, so the
> maps won't be needed after the reload if they were previously only
> needed due to dropped columns.  This is the case both with and without
> the pg_dump changes I proposed.  The case the patch does change is if
> the columns were actually out of order, which I saw as an unlikely
> thing to happen in the real world.

This is the case I was talking about, which I agree is very rare.  Sorry
for being unclear.

I think your proposed patch is fine and I don't want to argue that the way
things are now has some very sound basis.

Also, as you and Alvaro have found, the existing arrangement makes pg_dump
emit partitions in a way that's not super helpful (insert/copy failing
unintuitively), but it's not totally broken either.  That said, I don't
mean to oppose back-patching any fix you think is appropriate.

Thank you for working on this.

Regards,
Amit





Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-04-23 Thread Amit Langote
Thanks for looking at this.

On 2019/04/24 7:03, Tom Lane wrote:
> ISTM we could work around the problem with the attached, which I think
> is a good change independently of anything else.

Agreed.

> There is still an issue, which manifests in both 11 and HEAD, namely
> that the code also propagates the parent index's comment to any child
> indexes.  You can see that with this extended test case:
> 
> create table users(user_id int, name varchar(64), unique (user_id, name)) 
> partition by hash(user_id);
> comment on index users_user_id_name_key is 'parent index';
> create table users_000 partition of users for values with (modulus 2, 
> remainder 0);   
> create table users_001 partition of users for values with (modulus 2, 
> remainder 1);   
> 
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class 
> where relname like 'users%';
> alter table users alter column name type varchar(127);
>  
> select relname, relfilenode, obj_description(oid,'pg_class') from pg_class 
> where relname like 'users%';
> 
> which gives me (in 11, with this patch)
> 
> ...
>   relname   | relfilenode | obj_description 
> +-+-
>  users  |   89389 | 
>  users_000  |   89394 | 
>  users_000_user_id_name_key |   89397 | 
>  users_001  |   89399 | 
>  users_001_user_id_name_key |   89402 | 
>  users_user_id_name_key |   89392 | parent index
> (6 rows)
> 
> ALTER TABLE
>   relname   | relfilenode | obj_description 
> +-+-
>  users  |   89389 | 
>  users_000  |   89394 | 
>  users_000_user_id_name_key |   89406 | parent index
>  users_001  |   89399 | 
>  users_001_user_id_name_key |   89408 | parent index
>  users_user_id_name_key |   89404 | parent index
> (6 rows)

This may be seen as slightly worse if the child indexes had their own
comments, which would get overwritten by the parent's.

create table pp (a int, b text, unique (a, b)) partition by list (a);
create table pp1 partition of pp for values in (1);
create table pp2 partition of pp for values in (2);
comment on index pp_a_b_key is 'parent index';
comment on index pp1_a_b_key is 'child index 1';
comment on index pp2_a_b_key is 'child index 2';
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─┼─┼─
 pp  │   16420 │
 pp1 │   16425 │
 pp1_a_b_key │   16428 │ child index 1
 pp2 │   16433 │
 pp2_a_b_key │   16436 │ child index 2
 pp_a_b_key  │   16423 │ parent index
(6 rows)

alter table pp alter b type varchar(128);
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class
where relname like 'pp%';
   relname   │ relfilenode │ obj_description
─┼─┼─
 pp  │   16420 │
 pp1 │   16447 │
 pp1_a_b_key │   16450 │ parent index
 pp2 │   16451 │
 pp2_a_b_key │   16454 │ parent index
 pp_a_b_key  │   16441 │ parent index
(6 rows)

> However, I doubt that that's bad enough to justify a major rewrite
> of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
> I wouldn't be too unhappy to leave it to v13.

Yeah, it's probably decent amount of code churn to undertake as a bug-fix.

Thanks,
Amit





Re: pg_dump partitions can lead to inconsistent state after restore

2019-04-23 Thread David Rowley
On Wed, 24 Apr 2019 at 06:50, Alvaro Herrera  wrote:
> Per my comment at https://postgr.es/m/2019045129.GA6126@alvherre.pgsql
> I think that pg_dump can possibly cause bogus partition definitions,
> when the users explicitly decide to join tables as partitions that have
> different column ordering than the parent table.  Any COPY or INSERT
> command without an explicit column list that tries to put tuples in the
> table will fail after the restore.

Yeah, pg_dump itself is broken here, never mind dreaming up some other
user command.

We do use a column list when doing COPY, but with --inserts (not
--column-inserts) we don't include a column list.

All it takes is:

postgres=# create table listp (a int, b text) partition by list(a);
CREATE TABLE
postgres=# create table listp1 (b text, a int);
CREATE TABLE
postgres=# alter table listp attach partition listp1 for values in(1);
ALTER TABLE
postgres=# insert into listp values(1,'One');
INSERT 0 1
postgres=# \q

$ createdb test1
$ pg_dump --inserts postgres | psql test1
...
ERROR:  invalid input syntax for type integer: "One"
LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);

That settles the debate on the other thread...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread Alvaro Herrera
On 2019-Apr-24, David Rowley wrote:

> On Wed, 24 Apr 2019 at 10:26, Alvaro Herrera  wrote:
> > If creating a partition, there is the additional rule that parent's
> > tablespace overrides default_tablespace:
> >
> >  a2. if there's a TABLESPACE clause, use that.
> >  b2. otherwise, if the parent has a tablespace, use that.
> >  c2. otherwise, if there's a default_tablespace, use that.
> >  d2. otherwise, use the database tablespace.
> >  e2. if we end up with the database tablespace, overwrite with 0.
> 
> Wouldn't it just take the proposed pg_dump change to get that?  rule
> e2 says we'll store 0 in reltablespace, even if the user does
> TABLESPACE pg_default, so there's no requirement to adjust the hack in
> heap_create to put any additional conditions on when we set
> reltablespace to 0, so it looks like none of the patching work you did
> would be required to implement this. Right?

I'm not sure yet that 100% of the patch is gone, but yes much of it
would go away thankfully.  I do suggest we should raise an error if rule
a3 hits and it mentions the database tablespace (I stupidly forgot
this critical point in the previous email).  I think astonishment is
lesser that way.

> The good part about that is that its consistent with what happens if
> the user does TABLESPACE pg_default for any other object type that
> supports tablespaces. i.e we always store 0.

Yeah, it makes the whole thing a lot simpler.  Note my note for further
development of a feature (modelled after Robert's proposal) to allow the
database tablespace to be specified, using either a separate pg_tablespace
entry that means "use the database tablespace whatever that is" (Tom's
suggestion), or a magic not-a-real-tablespace-OID number known to the
code, such as 1 (Robert's).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Michael Paquier
On Tue, Apr 23, 2019 at 07:54:52PM -0400, Tom Lane wrote:
> That seems like pretty much of a hack :-(, in particular I'm not
> convinced that we'd not end up with a missing index entry afterwards.
> Maybe it's the only way, but I think first we need to trace down
> exactly why this broke.  I remember we had some special-case code
> for reindexing pg_class, maybe something broke that?

Yes, reindex_relation() has some infra to enforce the list of indexes
in the cache for pg_class which has been introduced by a56a016 as far
as it goes.

> It also seems quite odd that it doesn't fail every time; surely it's
> not conditional whether we'll try to insert a new pg_class tuple or not?
> We need to understand that, too.  Maybe the old code never really
> worked in all cases?  It seems clear that the commit you bisected to
> just allowed a pre-existing misbehavior to become exposed (more easily).
> 
> No time to look closer right now.

Yeah, there is a fishy smell underneath which comes from 9.6.  When
testing with 9.5 or older a database creation does not create any
crash on a subsequent reindex.  Not sure I'll have the time to look at
that more today, perhaps tomorrow depending on the odds.
--
Michael


signature.asc
Description: PGP signature


Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Tom Lane
I wrote:
> It also seems quite odd that it doesn't fail every time; surely it's
> not conditional whether we'll try to insert a new pg_class tuple or not?
> We need to understand that, too.

Oh!  One gets you ten it "works" as long as the pg_class update is a
HOT update, so that we don't actually end up touching the indexes.
This explains why the crash is less likely to happen in a database
where one's done some work (and, probably, created some dead space in
pg_class).  On the other hand, it doesn't quite fit the observation
that a VACUUM FULL masked the problem ... wouldn't that have ended up
with densely packed pg_class?  Maybe not, if it rebuilt everything
else after pg_class...

regards, tom lane




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Tom Lane
Michael Paquier  writes:
> On Tue, Apr 23, 2019 at 04:47:19PM -0400, Tom Lane wrote:
>> Is there some precondition you're not mentioning?

> Hm.  In my own init scripts, I create a new database just after
> starting the instance.

Ah, there we go:

regression=# create database d1;
CREATE DATABASE
regression=# \c d1
You are now connected to database "d1" as user "postgres".
d1=# reindex index pg_class_relname_nsp_index;
psql: server closed the connection unexpectedly

log shows

TRAP: FailedAssertion("!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))", 
File: "indexam.c", Line: 204)

#2  0x008c74ed in ExceptionalCondition (
conditionName=, errorType=, 
fileName=, lineNumber=)
at assert.c:54
#3  0x004e4f8c in index_insert (indexRelation=0x7f80f849a5d8, 
values=0x7ffc4f65b030, isnull=0x7ffc4f65b130, heap_t_ctid=0x2842c0c, 
heapRelation=0x7f80f84bab68, checkUnique=UNIQUE_CHECK_YES, 
indexInfo=0x2843230) at indexam.c:204
#4  0x0054c290 in CatalogIndexInsert (indstate=, 
heapTuple=0x2842c08) at indexing.c:140
#5  0x0054c472 in CatalogTupleUpdate (heapRel=0x7f80f84bab68, 
otid=0x2842c0c, tup=0x2842c08) at indexing.c:215
#6  0x008bca77 in RelationSetNewRelfilenode (relation=0x7f80f849a5d8, 
persistence=112 'p') at relcache.c:3531
#7  0x00548b3a in reindex_index (indexId=2663, 
skip_constraint_checks=false, persistence=112 'p', options=0)
at index.c:3339
#8  0x005ed099 in ReindexIndex (indexRelation=, 
options=0, concurrent=false) at indexcmds.c:2304
#9  0x007b5925 in standard_ProcessUtility (pstmt=0x281fd70, 

> If I apply my previous patch to make CatalogIndexInsert() not do
> an insert on a catalog index being rebuilt, then things turn to be
> fine.

That seems like pretty much of a hack :-(, in particular I'm not
convinced that we'd not end up with a missing index entry afterwards.
Maybe it's the only way, but I think first we need to trace down
exactly why this broke.  I remember we had some special-case code
for reindexing pg_class, maybe something broke that?

It also seems quite odd that it doesn't fail every time; surely it's
not conditional whether we'll try to insert a new pg_class tuple or not?
We need to understand that, too.  Maybe the old code never really
worked in all cases?  It seems clear that the commit you bisected to
just allowed a pre-existing misbehavior to become exposed (more easily).

No time to look closer right now.

regards, tom lane




Re: Runtime pruning problem

2019-04-23 Thread David Rowley
On Tue, 23 Apr 2019 at 01:12, David Rowley  wrote:
> I started working on this today and I've attached what I have so far.

I've added this to the July commitfest so that I don't forget about it.

https://commitfest.postgresql.org/23/2102/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Michael Paquier
On Tue, Apr 23, 2019 at 04:47:19PM -0400, Tom Lane wrote:
> regression=# reindex index pg_class_relname_nsp_index;
> REINDEX
> regression=# reindex index pg_class_oid_index;
> REINDEX
> regression=# reindex index pg_class_tblspc_relfilenode_index;
> REINDEX
> regression=# reindex table pg_class; 
> REINDEX
> regression=# reindex index pg_class_tblspc_relfilenode_index;
> REINDEX
> 
> Is there some precondition you're not mentioning?

Hm.  In my own init scripts, I create a new database just after
starting the instance.  That seems to help in reproducing the
failure, because each time I create a new database, connect to it and
reindex then I can see the crash.  If I do a reindex of pg_class
first, I don't see a crash of some rebuilds already happened, but if I
do directly a reindex of one of the indexes first, then the failure is
plain.  If I also add some regression tests, say in create_index.sql
to stress a reindex of pg_class and its indexes, the crash also shows
up.  If I apply my previous patch to make CatalogIndexInsert() not do
an insert on a catalog index being rebuilt, then things turn to be
fine.
--
Michael


signature.asc
Description: PGP signature


Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread David Rowley
On Wed, 24 Apr 2019 at 10:26, Alvaro Herrera  wrote:
> If creating a partition, there is the additional rule that parent's
> tablespace overrides default_tablespace:
>
>  a2. if there's a TABLESPACE clause, use that.
>  b2. otherwise, if the parent has a tablespace, use that.
>  c2. otherwise, if there's a default_tablespace, use that.
>  d2. otherwise, use the database tablespace.
>  e2. if we end up with the database tablespace, overwrite with 0.

Wouldn't it just take the proposed pg_dump change to get that?  rule
e2 says we'll store 0 in reltablespace, even if the user does
TABLESPACE pg_default, so there's no requirement to adjust the hack in
heap_create to put any additional conditions on when we set
reltablespace to 0, so it looks like none of the patching work you did
would be required to implement this. Right?

The good part about that is that its consistent with what happens if
the user does TABLESPACE pg_default for any other object type that
supports tablespaces. i.e we always store 0.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Pluggable Storage - Andres's take

2019-04-23 Thread Tom Lane
Andres Freund  writes:
> ... I think none of these are critical issues for tableam, but we should fix
> them.

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Probably better to fix those API issues now rather than later.

regards, tom lane




Re: Pluggable Storage - Andres's take

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> index_update_stats() calls RelationGetNumberOfBlocks(). If the AM
> doesn't use normal data files, that won't work. I bumped into that with my
> toy implementation, which wouldn't need to create any data files, if it
> wasn't for this.

There are a few more of these:

1) index_update_stats(), computing pg_class.relpages

   Feels like the number of both heap and index blocks should be
   computed by the index build and stored in IndexInfo. That'd also get
   a bit closer towards allowing indexams not going through smgr (useful
   e.g. for memory only ones).

2) commands/analyze.c, computing pg_class.relpages

   This should imo be moved to the tableam callback. It's currently done
   a bit weirdly imo, with fdws computing relpages the callback, but
   then also returning the acquirefunc. Seems like it should entirely be
   computed as part of calling acquirefunc.

3) nodeTidscan, skipping over too large tids
   I think this should just be moved into the AMs, there's no need to
   have this in nodeTidscan.c

4) freespace.c, used for the new small-rels-have-no-fsm paths.
   That's being revised currently anyway. But I'm not particularly
   concerned even if it stays as is - freespace use is optional
   anyway. And I can't quite see an AM that doesn't want to use
   postgres' storage mechanism wanting to use freespace.c

   Therefore I'm inclined not to thouch this independent of fixing the
   others.

I think none of these are critical issues for tableam, but we should fix
them.

I'm not sure about doing so for v12 though. 1) and 3) are fairly
trivial, but 2) would involve changing the FDW interface, by changing
the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
we're not even in beta1.

Comments?

Greetings,

Andres Freund




Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread Alvaro Herrera
Thanks for taking the time to think through this.

On 2019-Apr-22, Robert Haas wrote:

> If we know what the feature is supposed to do, then it should be
> possible to look at each relevant piece of code and decides whether it
> implements the specification correctly or not.  But if we don't have a
> specification -- that is, we don't know precisely what the feature is
> supposed to do -- then we'll just end up whacking the behavior around
> trying to get some combination that makes sense, and the whole effort
> is probably doomed.

Clearly this is now the crux of the issue ... the specification was
unclear, as I cobbled it together without thinking about the overall
implications, taking pieces of advice from various posts and my own
ideas of how it should work.

I am trying to split things in a way that makes the most sense to offer
the best possible combination of functionality.  I think there are three
steps to getting this done:

1. for pg10 and pg11, make pg_dump work correctly.  I think having
pg_dump use CREATE TABLE PARTITION OF is not correct when the partition
has a mismatching column definition.  So my proposal is to back-patch
David's pg_dump patch to pg10 and pg11.  Thread:
https://postgr.es/m/20190423185007.GA27954@alvherre.pgsql
Note that this changes pg_dump behavior in back branches.

2. for pg12, try to keep as much of the tablespace inheritance
functionality as possible (more below on how this works), without
running into weird cases.  I think if we just forbid the case of
the tablespace being defined to the database tablespace, all the
weird cases disappear from the code.

3. For pg13, we can try to put back the functionality of the database
tablespace as default for a partition.  You suggest using the value 1,
and Tom suggests adding a new predefined row in pg_tablespace that has
the meaning of "whatever the default tablespace is for the current
database".  I have a slight preference towards having the additional
magical row(s).

Maybe people would like to see #3 put it pg12.  I don't oppose that, but
it seems to be new functionality development; RMT would have to exempt
this from feature freeze.

Robert wrote:

> - I don't have any idea how to salvage things in v11, where the
> catalog representation is irredeemably ambiguous.  We'd probably have
> to be satisfied with fairly goofy behavior in v11.

This behavior is in use only for indexes in pg11, and does not allow
setting the database tablespace anyway (it just gets set back to 0 if
you do that), so I don't think there's anything *too* goofy about it.
(But whatever we do, pg11 behavior for tables would differ from pg12;
and it would differ for indexes too, depending on how we handle the
default_tablespace thing.)

> - It's not enough to have a good catalog representation.  You also
> have to be able to recreate those catalog states.  I think you
> probably need a way to set the reltablespace to whatever value you
> need it to have without changing the tables under it.  Like ALTER
> TABLE ONLY blah {TABLESPACE some_tablespace_name | NO TABLESPACE} or
> some such thing.  That exact idea may be wrong, but I think we are not
> going to have much luck getting to a happy place without something of
> this sort.

David proposed ALTER TABLE .. TABLESPACE DEFAULT as a way to put back
the database tablespace behavior (and that's implemented in my previous
patch), which seems good to me.  I'm not sure I like "NO TABLESPACE"
better than DEFAULT, but if others do, I'm not really wedded to it.


Now I discuss how tablespace inheritance works.  To define the
tablespace of a regular table, there's a simple algorithm:

 a1. if there's a TABLESPACE clause, use that.
 b1. otherwise, if there's a default_tablespace, use that.
 c1. otherwise, use the database tablespace.
 d1. if we end up with the database tablespace, overwrite with 0.

If creating a partition, there is the additional rule that parent's
tablespace overrides default_tablespace:

 a2. if there's a TABLESPACE clause, use that.
 b2. otherwise, if the parent has a tablespace, use that.
 c2. otherwise, if there's a default_tablespace, use that.
 d2. otherwise, use the database tablespace.
 e2. if we end up with the database tablespace, overwrite with 0.

Finally, there's the question of what defines the parent's tablespace.
I think the best algorithm to determine the tablespace when creating a
partitioned relation is the same as for partitions:

 a3. if there's a TABLESPACE clause, use that.
 b3. otherwise, if the parent has a tablespace, use that.  (We have this
  because partitions might be partitioned.)
 c3. otherwise, if there's default_tablespace, use that.
 d3. otherwise, use the database tablespace.
 e3. if we end up with the database tablespace, overwrite with 0.

Tracing through the sets of rules, they are all alike -- the only
difference is that regular tables lack the rule about parent's
tablespace.

Some notes on this:

1. We have rule c3, that is, we still honor 

Re: Why does ExecComputeStoredGenerated() form a heap tuple

2019-04-23 Thread David Rowley
On Tue, 23 Apr 2019 at 20:23, Peter Eisentraut
 wrote:
> The attached patch is based on your sketch.  It's clearly better in the
> long term not to rely on heap tuples here.  But in testing this change
> seems to make it slightly slower, certainly not a speedup as you were
> apparently hoping for.
>
>
> Test setup:
>
> create table t0 (a int, b int);
> insert into t0 select generate_series (1, 1000);  -- 10 million
> \copy t0 (a) to 'test.dat';
>
> -- for comparison, without generated column
> truncate t0;
> \copy t0 (a) from 'test.dat';
>
> -- master
> create table t1 (a int, b int generated always as (a * 2) stored);
> truncate t1;
> \copy t1 (a) from 'test.dat';
>
> -- patched
> truncate t1;
> \copy t1 (a) from 'test.dat';

I didn't do the exact same test, but if I use COPY instead of \copy,
then for me patched is faster.

Normal table:


postgres=# copy t0 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 5437.768 ms (00:05.438)
postgres=# truncate t0;
TRUNCATE TABLE
Time: 20.775 ms
postgres=# copy t0 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 5272.228 ms (00:05.272)

Master:

postgres=# copy t1 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 6570.031 ms (00:06.570)
postgres=# truncate t1;
TRUNCATE TABLE
Time: 17.813 ms
postgres=# copy t1 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 6486.253 ms (00:06.486)

Patched:

postgres=# copy t1 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 5359.338 ms (00:05.359)
postgres=# truncate table t1;
TRUNCATE TABLE
Time: 25.551 ms
postgres=# copy t1 (a) from '/home/drowley/test.dat';
COPY 1000
Time: 5347.596 ms (00:05.348)


For the patch, I wonder if you need this line:

+ memcpy(values, slot->tts_values, sizeof(*values) * natts);

If you got rid of that and changed the datumCopy to use
slot->tts_values[i] instead.

Maybe it's also worth getting rid of the first memcpy for the null
array and just assign the element in the else clause.

It might also be cleaner to assign TupleDescAttr(tupdesc, i) to a
variable instead of using the macro 3 times. It'd make that datumCopy
line shorter too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: TM format can mix encodings in to_char()

2019-04-23 Thread Tom Lane
Andrew Dunstan  writes:
> Test above works as expected with the patch, see attached.  This is from
> jacana.

Great, thanks for checking!

> LMK if you want more tests run before I blow the test instance away

Can't think of anything else.

It'd be nice if we could cover stuff like this in the regression tests,
but I'm not sure how, seeing that the locale names are platform-dependent
and the overall behavior will also depend on the database encoding ...

regards, tom lane




Re: TM format can mix encodings in to_char()

2019-04-23 Thread Andrew Dunstan

On 4/21/19 10:21 AM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> On 4/21/19 12:28 AM, Tom Lane wrote:
>>> I don't have any way to test this on Windows, so could somebody
>>> do that?  Manually running the Turkish test cases ought to be enough.
>> How does one do that? Just set a Turkish locale?
> Try variants of the original test case.  For instance, in a UTF8
> database,
>
> regression=# show server_encoding ;
>  server_encoding 
> -
>  UTF8
> (1 row)
>
> regression=# SET lc_time TO 'tr_TR.iso88599';
> SET
> regression=# SELECT to_char(date '2010-02-01', 'DD TMMON ');
>to_char
> --
>  01 ŞUB 2010
> (1 row)
>
> Unpatched, I get an error about invalid data.  Now, this is in
> a Linux machine, and you'll have to adapt it for Windows --- at
> least change the LC_TIME setting.  But the idea is to get out some
> non-ASCII strings from an LC_TIME setting that names an encoding
> different from the database's.
>
> (I suspect you'll find that the existing code works fine on
> Windows, it's only the first version(s) of this patch that fail.)
>
>   



Test above works as expected with the patch, see attached.  This is from
jacana.


LMK if you want more tests run before I blow the test instance away


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

select getdatabaseencoding();
 getdatabaseencoding 
-
 UTF8
(1 row)

set lc_time to 'Turkish_Turkey.1254';
SET
select to_char(date '2010-02-01','DD TMMON ');
   to_char   
-
 01 ŞUB 2010
(1 row)



Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-04-23 Thread Tom Lane
ISTM we could work around the problem with the attached, which I think
is a good change independently of anything else.

There is still an issue, which manifests in both 11 and HEAD, namely
that the code also propagates the parent index's comment to any child
indexes.  You can see that with this extended test case:

create table users(user_id int, name varchar(64), unique (user_id, name)) 
partition by hash(user_id);
comment on index users_user_id_name_key is 'parent index';
create table users_000 partition of users for values with (modulus 2, remainder 
0);   
create table users_001 partition of users for values with (modulus 2, remainder 
1);   

select relname, relfilenode, obj_description(oid,'pg_class') from pg_class 
where relname like 'users%';
alter table users alter column name type varchar(127); 
select relname, relfilenode, obj_description(oid,'pg_class') from pg_class 
where relname like 'users%';

which gives me (in 11, with this patch)

...
  relname   | relfilenode | obj_description 
+-+-
 users  |   89389 | 
 users_000  |   89394 | 
 users_000_user_id_name_key |   89397 | 
 users_001  |   89399 | 
 users_001_user_id_name_key |   89402 | 
 users_user_id_name_key |   89392 | parent index
(6 rows)

ALTER TABLE
  relname   | relfilenode | obj_description 
+-+-
 users  |   89389 | 
 users_000  |   89394 | 
 users_000_user_id_name_key |   89406 | parent index
 users_001  |   89399 | 
 users_001_user_id_name_key |   89408 | parent index
 users_user_id_name_key |   89404 | parent index
(6 rows)

However, I doubt that that's bad enough to justify a major rewrite
of the ALTER TABLE code in 11 ... and maybe not in HEAD either;
I wouldn't be too unhappy to leave it to v13.

regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..c5c543f 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -252,8 +252,17 @@ CheckIndexCompatible(Oid oldId,
 	if (!ret)
 		return false;
 
-	/* For polymorphic opcintype, column type changes break compatibility. */
+	/* We now need to actually look at the index rel. */
 	irel = index_open(oldId, AccessShareLock);	/* caller probably has a lock */
+
+	/* If it's a partitioned index, there is no storage to share. */
+	if (irel->rd_rel->relkind == RELKIND_PARTITIONED_INDEX)
+	{
+		index_close(irel, NoLock);
+		return false;
+	}
+
+	/* For polymorphic opcintype, column type changes break compatibility. */
 	for (i = 0; i < old_natts; i++)
 	{
 		if (IsPolymorphicType(get_opclass_input_type(classObjectId[i])) &&


Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Alvaro Herrera
On 2019-Apr-18, Michael Paquier wrote:

> Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
> trying to REINDEX directly an index of pg_class fails lamentably on an
> assertion failure (mbsync failure if bypassing the assert):

Hmm, yeah, I ran into this crash too, more than a year ago, but I don't
recall what came out of the investigation, and my search-fu is failing
me.  I'll have a look at my previous laptop's drive ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-23 Thread Tom Lane
Michael Paquier  writes:
> Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
> trying to REINDEX directly an index of pg_class fails lamentably on an
> assertion failure (mbsync failure if bypassing the assert):

So ... I can't reproduce this on HEAD.  Nor the back branches.

regression=# \d pg_class
...
Indexes:
"pg_class_oid_index" UNIQUE, btree (oid)
"pg_class_relname_nsp_index" UNIQUE, btree (relname, relnamespace)
"pg_class_tblspc_relfilenode_index" btree (reltablespace, relfilenode)

regression=# reindex index pg_class_relname_nsp_index;
REINDEX
regression=# reindex index pg_class_oid_index;
REINDEX
regression=# reindex index pg_class_tblspc_relfilenode_index;
REINDEX
regression=# reindex table pg_class; 
REINDEX
regression=# reindex index pg_class_tblspc_relfilenode_index;
REINDEX

Is there some precondition you're not mentioning?

regards, tom lane




Re: How and at what stage to stop FDW to generate plan with JOIN.

2019-04-23 Thread Ibrar Ahmed
On Wed, Apr 24, 2019 at 1:15 AM Tom Lane  wrote:

> Ibrar Ahmed  writes:
> > I am working on an FDW where the database does not support any operator
> > other than "=" in JOIN condition. Some queries are genrating the plan
> with
> > JOIN having "<" operator. How and at what stage I can stop FDW to not
> make
> > such a plan. Here is my sample query.
>
> What exactly do you think should happen instead?  You can't just tell
> users not to ask such a query.  (Well, you can try, but they'll probably
> go looking for a less broken FDW.)
>
> I know that.


> If what you really mean is you don't want to generate pushed-down
> foreign join paths containing non-equality conditions, the answer is
> to just not do that.  That'd be the FDW's own fault, not that of
> the core planner, if it creates a path representing a join it
> can't actually implement.  You'll end up running the join locally,
> which might not be great, but if you have no other alternative
> then that's what you gotta do.
>
> Yes, that's what I am thinking. In case of non-equality condition join
them locally is
the only solution. I was just confirming.


> If what you mean is you don't know how to inspect the join quals
> to see if you can implement them, take a look at postgres_fdw
> to see how it handles the same issue.
>
> I really don't know postgres_fdw have the same issue, but yes postgres_fdw
is always my starting point.


> regards, tom lane
>


-- 
Ibrar Ahmed


Re: How and at what stage to stop FDW to generate plan with JOIN.

2019-04-23 Thread Tom Lane
Ibrar Ahmed  writes:
> I am working on an FDW where the database does not support any operator
> other than "=" in JOIN condition. Some queries are genrating the plan with
> JOIN having "<" operator. How and at what stage I can stop FDW to not make
> such a plan. Here is my sample query.

What exactly do you think should happen instead?  You can't just tell
users not to ask such a query.  (Well, you can try, but they'll probably
go looking for a less broken FDW.)

If what you really mean is you don't want to generate pushed-down
foreign join paths containing non-equality conditions, the answer is
to just not do that.  That'd be the FDW's own fault, not that of
the core planner, if it creates a path representing a join it
can't actually implement.  You'll end up running the join locally,
which might not be great, but if you have no other alternative
then that's what you gotta do.

If what you mean is you don't know how to inspect the join quals
to see if you can implement them, take a look at postgres_fdw
to see how it handles the same issue.

regards, tom lane




Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-23 Thread Fabien COELHO




Keep in mind that a large part of the reason why the \cset patch got
bounced was exactly that its detection of \; was impossibly ugly
and broken.  Don't expect another patch using the same logic to
get looked on more favorably.


Looking at the end of the discussion about \cset, it seems what
you were against was not much how the detection was done rather
than how and why it was used thereafter.

In the case of the present bug, we just need to know whether there
are any \; query separators in the command string.
If yes, then SendQuery() doesn't get to use the cursor technique to
avoid any risk with that command string, despite FETCH_COUNT>0.

PFA a simple POC patch implementing this.


Indeed it does not look that bad.

Note a side effect of simply counting: "SELECT 1 \; ;" is detected as 
compound, but internal misplaced optimization would result in only one 
result as the empty one is removed, so the cursor trick would work.


In some earlier version, not sure whether I sent it, I tried to keep their 
position with some int array and detect empty queries, which was a lot of 
(ugly) efforts.


--
Fabien.




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-23 Thread Tom Lane
I wrote:
> Peter Geoghegan  writes:
>> That doesn't seem like a particularly good or complete answer, though.
>> Perhaps it should simply be called within BufFileCreateTemp(). The
>> BufFile/fd.c layering is confusing in a number of ways IMV.

> I don't actually see why BufFileCreateTemp should do it; if
> we're to add a call, seems like OpenTemporaryFile is the place,
> as that's what is really concerned with the temp tablespace(s).
> I'm in favor of doing this, I think, as it sure looks to me like
> gistInitBuildBuffers() is calling BufFileCreateTemp without any
> closely preceding PrepareTempTablespaces.  So we already have an
> instance of Melanie's bug in core.  It'd be difficult to notice
> because of the silent-fallback-to-default-tablespace behavior.

Here's a draft patch for that.

It's slightly ugly that this adds a dependency on commands/tablespace
to fd.c, which is a pretty low-level module.  I think wanting to avoid
that layering violation might've been the reason for doing things the
way they are.  However, this gets rid of tablespace dependencies in
some other files that are only marginally higher-level, like
tuplesort.c, so I'm not sure how strong that objection is.

There are three functions in fd.c that have a dependency on the
temp tablespace info having been set up:
OpenTemporaryFile
GetTempTablespaces
GetNextTempTableSpace
This patch makes the first of those automatically set up the info
if it's not done yet.  The second one has always had an assertion
that the caller did it already, and now the third one does too.
An about equally plausible change would be to make all three
call PrepareTempTablespaces, but there are so few callers of the
second and third that I'm not sure that'd be better.  Thoughts?

regards, tom lane

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91..db436e0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -29,7 +29,6 @@
 #include "access/htup_details.h"
 #include "access/parallel.h"
 #include "catalog/pg_statistic.h"
-#include "commands/tablespace.h"
 #include "executor/execdebug.h"
 #include "executor/hashjoin.h"
 #include "executor/nodeHash.h"
@@ -570,9 +569,7 @@ ExecHashTableCreate(HashState *state, List *hashOperators, List *hashCollations,
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
-		/* The files will not be opened until needed... */
-		/* ... but make sure we have temp tablespaces established for them */
-		PrepareTempTablespaces();
+		/* The files will not be opened until needed */
 	}
 
 	MemoryContextSwitchTo(oldcxt);
@@ -917,8 +914,6 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 			palloc0(nbatch * sizeof(BufFile *));
 		hashtable->outerBatchFile = (BufFile **)
 			palloc0(nbatch * sizeof(BufFile *));
-		/* time to establish the temp tablespaces, too */
-		PrepareTempTablespaces();
 	}
 	else
 	{
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index fdac985..da89f2b 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -83,6 +83,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
 #include "portability/mem.h"
@@ -1466,6 +1467,14 @@ OpenTemporaryFile(bool interXact)
 	File		file = 0;
 
 	/*
+	 * If we want to use a temp tablespace, make sure that info has been set
+	 * up in the current transaction.  (This is harmless if we're not in a
+	 * transaction, and it's also cheap if the info is already set up.)
+	 */
+	if (!interXact)
+		PrepareTempTablespaces();
+
+	/*
 	 * Make sure the current resource owner has space for this File before we
 	 * open it, if we'll be registering it below.
 	 */
@@ -2732,6 +2741,7 @@ GetTempTablespaces(Oid *tableSpaces, int numSpaces)
 Oid
 GetNextTempTableSpace(void)
 {
+	Assert(TempTablespacesAreSet());
 	if (numTempTableSpaces > 0)
 	{
 		/* Advance nextTempTableSpace counter with wraparound */
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3eebd9e..bd9f215 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -101,7 +101,6 @@
 #include "access/nbtree.h"
 #include "catalog/index.h"
 #include "catalog/pg_am.h"
-#include "commands/tablespace.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -2464,13 +2463,6 @@ inittapestate(Tuplesortstate *state, int maxTapes)
 	if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
 		USEMEM(state, tapeSpace);
 
-	/*
-	 * Make sure that the temp file(s) underlying the tape set are created in
-	 * suitable temp tablespaces.  For parallel sorts, this should have been
-	 * called already, but it doesn't matter if it is called a second time.
-	 */
-	

How and at what stage to stop FDW to generate plan with JOIN.

2019-04-23 Thread Ibrar Ahmed
Hi,

I am working on an FDW where the database does not support any operator
other than "=" in JOIN condition. Some queries are genrating the plan with
JOIN having "<" operator. How and at what stage I can stop FDW to not make
such a plan. Here is my sample query.



tpch=# select

l_orderkey,

sum(l_extendedprice * (1 - l_discount)) as revenue,

o_orderdate,

o_shippriority

from

customer,

orders,

lineitem

where

c_mktsegment = 'BUILDING'

and c_custkey = o_custkey

and l_orderkey = o_orderkey

and o_orderdate < date '1995-03-22'

and l_shipdate > date '1995-03-22'

group by

l_orderkey,

o_orderdate,

o_shippriority

order by

revenue,

o_orderdate

LIMIT 10;



   QUERY PLAN


...

Merge Cond: (orders.o_orderkey = lineitem.l_orderkey)

->  Foreign Scan  (cost=1.00..-1.00 rows=1000 width=50)

Output: orders.o_orderdate, orders.o_shippriority, orders.o_orderkey

Relations: (customer) INNER JOIN (orders)

Remote SQL: SELECT r2.o_orderdate, r2.o_shippriority, r2.o_orderkey
FROM  db.customer
r1 ALL INNER JOIN db.orders r2 ON (((r1.c_custkey = r2.o_custkey)) AND
((r2.o_orderdate < '1995-03-22')) AND ((r1.c_mktsegment = 'BUILDING')))
ORDER BY r2.o_orderkey, r2.o_orderdate, r2.o_shippriority

...


--

Ibrar Ahmed


Re: block-level incremental backup

2019-04-23 Thread Adam Brusselback
I hope it's alright to throw in my $0.02 as a user. I've been following
this (and the other thread on reading WAL to find modified blocks,
prefaulting, whatever else) since the start with great excitement and would
love to see the built-in backup capabilities in Postgres greatly improved.
I know this is not completely on-topic for just incremental backups, so I
apologize in advance. It just seemed like the most apt place to chime in.


Just to preface where I am coming from, I have been using pgBackRest for
the past couple years and used wal-e prior to that.  I am not a big *nix
user other than all my servers, do all my development on Windows / use
primarily Java. The command line is not where I feel most comfortable
despite my best efforts over the last 5-6 years. Prior to Postgres, I used
SQL Server for quite a few years at previous companies but was more a
junior / intermediate skill set back then. I just wanted to put that out
there so you can see where my bias's are.




With all that said, I would not be comfortable using pg_basebackup as my
main backup tool simply because I’d have to cobble together numerous tools
to get backups stored in a safe (not on the same server) location, I’d have
to manage expiring backups and the WAL which is no longer needed, along
with the rest of the stuff that makes these backup management tools useful.


The command line scares me, and even if I was able to get all that working,
I would not feel warm and fuzzy I didn’t mess something up horribly and I
may hit an edge case which destroys backups, silently corrupts data, etc.

I love that there are tools that manage all of it; backups, wal archiving,
remote storage, integrate with cloud storage (S3 and the like), manages the
retention of these backups with all their dependencies for me, and has all
the restore options necessary built in as well.


Block level incremental backup would be amazing for my use case. I have
small updates / deletes that happen to data all over some of my largest
tables. With pgBackRest, since the diff/incremental backups are at the file
level, I can have a single update / delete which touched a random spot in a
table and now requires that whole 1gb file to be backed up again. That
said, even if pg_basebackup was the only tool that did incremental block
level backup tomorrow, I still wouldn’t start using it directly. I went
into the issues I’d have to deal with if I used pg_basebackup above, and
incremental backups without a management tool make me think using it
correctly would be much harder.


I know this thread is just about incremental backup, and that pretty much
everything in core is built up from small features into larger more complex
ones. I understand that and am not trying to dump on any efforts, I am
super excited to see work being done in this area! I just wanted to share
my perspective on how crucial good backup management is to me (and I’m sure
a few others may share my sentiment considering how popular all the
external tools are).

I would never put a system in production unless I have some backup
management in place. If core builds a backup management tool which uses
pg_basebackup as building blocks for its solution…awesome! That may be
something I’d use.  If pg_basebackup can be improved so it can be used as
the basis most external backup management tools can build on top of, that’s
also great. All the external tools which practically every Postgres company
have built show that it’s obviously a need for a lot of users. Core will
never solve every single problem for all users, I know that. It would just
be great to see some of the fundamental features of backup management baked
into core in an extensible way.

With that, there could be a recommended way to set up backups
(full/incremental, parallel, compressed), point in time recovery, backup
retention, and perform restores (to a point in time, on a replica server,
etc) with just the tooling within core with a nice and simple user
interface, and great performance.

If those features core supports in the internal tooling are built in an
extensible way (as has been discussed), there could be much less
duplication of work implementing the same base features over and over for
each external tool. Those companies can focus on more value-added features
to their own products that core would never support, or on improving the
tooling/performance/features core provides.


Well, this is way longer and a lot less coherent than I was hoping, so I
apologize for that. Hopefully my stream of thoughts made a little bit of
sense to someone.


-Adam


Minor postmaster state machine bugs

2019-04-23 Thread Tom Lane
In pursuit of the problem with standby servers sometimes not responding
to fast shutdowns [1], I spent awhile staring at the postmaster's
state-machine logic.  I have not found a cause for that problem,
but I have identified some other things that seem like bugs:


1. sigusr1_handler ignores PMSIGNAL_ADVANCE_STATE_MACHINE unless the
current state is PM_WAIT_BACKUP or PM_WAIT_BACKENDS.  This restriction
seems useless and shortsighted: PostmasterStateMachine should behave
sanely regardless of our state, and sigusr1_handler really has no
business assuming anything about why a child is asking for a state
machine reconsideration.  But it's not just not future-proof, it's a
live bug even for the one existing use-case, which is that a new
walsender sends this signal after it's re-marked itself as being a
walsender rather than a normal backend.  Consider this sequence of
events:
* System is running as a hot standby and allowing cascaded replication.
There are no live backends.
* New replication connection request is received and forked off.
(At this point the postmaster thinks this child is a normal session
backend.)
* SIGTERM (Smart Shutdown) is received.  Postmaster will transition
to PM_WAIT_READONLY.  I don't think it would have autovac or bgworker
or bgwriter or walwriter children, but if so, assume they all exit
before the next step.  Postmaster will continue to sleep, waiting for
its one "normal" child backend to finish.
* Replication connection request completes, so child re-marks itself
as a walsender and sends PMSIGNAL_ADVANCE_STATE_MACHINE.
* Postmaster ignores signal because it's in the "wrong" state, so it
doesn't realize it now has no normal backend children.
* Postmaster waits forever, or at least till DBA loses patience and
sends a stronger signal.

This scenario doesn't explain the buildfarm failures since those don't
involve smart shutdowns (and I think they don't involve cascaded
replication either).  Still, it's clearly a bug, which I think
we should fix by removing the pointless restriction on whether
PostmasterStateMachine can be called.

Also, I'm inclined to think that that should be the *last* step in
sigusr1_handler, not randomly somewhere in the middle.  As coded,
it's basically assuming that no later action in sigusr1_handler
could affect anything that PostmasterStateMachine cares about, which
even if it's true today is another highly not-future-proof assumption.


2. MaybeStartWalReceiver will clear the WalReceiverRequested flag
even if it fails to launch a child process for some reason.  This
is just dumb; it should leave the flag set so that we'll try again
next time through the postmaster's idle loop.


3. PostmasterStateMachine's handling of PM_SHUTDOWN_2 is:

if (pmState == PM_SHUTDOWN_2)
{
/*
 * PM_SHUTDOWN_2 state ends when there's no other children than
 * dead_end children left. There shouldn't be any regular backends
 * left by now anyway; what we're really waiting for is walsenders and
 * archiver.
 *
 * Walreceiver should normally be dead by now, but not when a fast
 * shutdown is performed during recovery.
 */
if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
WalReceiverPID == 0)
{
pmState = PM_WAIT_DEAD_END;
}
}

The comment about walreceivers is confusing, and it's also wrong.  Maybe
it was valid when written, but today it's easy to trace the logic and see
that we can only get to PM_SHUTDOWN_2 state from PM_SHUTDOWN state, and
we can only get to PM_SHUTDOWN state when there is no live walreceiver
(cf processing of PM_WAIT_BACKENDS state), and we won't attempt to launch
a new walreceiver while in PM_SHUTDOWN or PM_SHUTDOWN_2 state, so it's
impossible for there to be any walreceiver here.  I think we should just
remove that comment and the WalReceiverPID == 0 test.


Comments?  I think at least the first two points need to be back-patched.

regards, tom lane

[1] https://www.postgresql.org/message-id/20190416070119.gk2...@paquier.xyz




Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-23 Thread Daniel Verite
Tom Lane wrote:

> Keep in mind that a large part of the reason why the \cset patch got
> bounced was exactly that its detection of \; was impossibly ugly
> and broken.  Don't expect another patch using the same logic to
> get looked on more favorably.

Looking at the end of the discussion about \cset, it seems what
you were against was not much how the detection was done rather
than how and why it was used thereafter.

In the case of the present bug, we just need to know whether there
are any \; query separators in the command string.
If yes, then SendQuery() doesn't get to use the cursor technique to
avoid any risk with that command string, despite FETCH_COUNT>0.

PFA a simple POC patch implementing this.

> Having said that, I did like the idea of maybe going over to
> PQsetSingleRowMode instead of using an explicit cursor.  That
> would represent a net decrease of cruftiness here, instead of
> layering more cruft on top of what's already a mighty ugly hack.

It would also work with queries that start with a CTE, and queries
like (UPDATE/INSERT/DELETE.. RETURNING), that the current way
with the cursor cannot handle. But that looks like a project for PG13,
whereas a fix like the attached could be backpatched.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index bd28444..78641e0 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1405,7 +1405,8 @@ SendQuery(const char *query)
 		results = NULL;			/* PQclear(NULL) does nothing */
 	}
 	else if (pset.fetch_count <= 0 || pset.gexec_flag ||
-			 pset.crosstab_flag || !is_select_command(query))
+			 pset.crosstab_flag || pset.num_semicolons > 0 ||
+			 !is_select_command(query))
 	{
 		/* Default fetch-it-all-and-print mode */
 		instr_time	before,
diff --git a/src/bin/psql/mainloop.c b/src/bin/psql/mainloop.c
index 3ae4470..e5af7a2 100644
--- a/src/bin/psql/mainloop.c
+++ b/src/bin/psql/mainloop.c
@@ -426,6 +426,7 @@ MainLoop(FILE *source)
 /* execute query unless we're in an inactive \if branch */
 if (conditional_active(cond_stack))
 {
+	pset.num_semicolons = psql_scan_get_escaped_semicolons(scan_state);
 	success = SendQuery(query_buf->data);
 	slashCmdStatus = success ? PSQL_CMD_SEND : PSQL_CMD_ERROR;
 	pset.stmt_lineno = 1;
@@ -502,6 +503,7 @@ MainLoop(FILE *source)
 	/* should not see this in inactive branch */
 	Assert(conditional_active(cond_stack));
 
+	pset.num_semicolons = psql_scan_get_escaped_semicolons(scan_state);
 	success = SendQuery(query_buf->data);
 
 	/* transfer query to previous_buf by pointer-swapping */
diff --git a/src/bin/psql/settings.h b/src/bin/psql/settings.h
index 5be5091..2755927 100644
--- a/src/bin/psql/settings.h
+++ b/src/bin/psql/settings.h
@@ -110,6 +110,8 @@ typedef struct _psqlSettings
 	char	   *inputfile;		/* file being currently processed, if any */
 	uint64		lineno;			/* also for error reporting */
 	uint64		stmt_lineno;	/* line number inside the current statement */
+	int			num_semicolons;	/* number of query separators (\;) found
+   inside the current statement */
 
 	bool		timing;			/* enable timing of all queries */
 
diff --git a/src/fe_utils/psqlscan.l b/src/fe_utils/psqlscan.l
index 850754e..ee32547 100644
--- a/src/fe_utils/psqlscan.l
+++ b/src/fe_utils/psqlscan.l
@@ -694,8 +694,15 @@ other			.
 	 * substitution.  We want these before {self}, also.
 	 */
 
-"\\"[;:]		{
-	/* Force a semi-colon or colon into the query buffer */
+"\\";			{
+	/* Count semicolons in compound commands */
+	cur_state->escaped_semicolons++;
+	/* Force a semicolon into the query buffer */
+	psqlscan_emit(cur_state, yytext + 1, 1);
+}
+
+"\\":			{
+	/* Force a colon into the query buffer */
 	psqlscan_emit(cur_state, yytext + 1, 1);
 }
 
@@ -1066,6 +1073,9 @@ psql_scan(PsqlScanState state,
 	/* Set current output target */
 	state->output_buf = query_buf;
 
+	/* Reset number of escaped semicolons seen */
+	state->escaped_semicolons = 0;
+
 	/* Set input source */
 	if (state->buffer_stack != NULL)
 		yy_switch_to_buffer(state->buffer_stack->buf, state->scanner);
@@ -1210,6 +1220,16 @@ psql_scan_reset(PsqlScanState state)
 }
 
 /*
+ * Return the number of escaped semicolons in the lexed string seen by the
+ * previous psql_scan call.
+ */
+int
+psql_scan_get_escaped_semicolons(PsqlScanState state)
+{
+	return state->escaped_semicolons;
+}
+
+/*
  * Reselect this lexer (psqlscan.l) after using another one.
  *
  * Currently and for foreseeable uses, it's sufficient to reset to INITIAL
diff --git a/src/include/fe_utils/psqlscan.h b/src/include/fe_utils/psqlscan.h
index c3a1d58..395ac78 100644
--- a/src/include/fe_utils/psqlscan.h
+++ b/src/include/fe_utils/psqlscan.h
@@ -83,6 +83,8 @@ extern PsqlScanResult psql_scan(PsqlScanState state,
 
 extern void psql_scan_reset(PsqlScanState state);
 
+extern 

pg_dump partitions can lead to inconsistent state after restore

2019-04-23 Thread Alvaro Herrera
Per my comment at https://postgr.es/m/2019045129.GA6126@alvherre.pgsql
I think that pg_dump can possibly cause bogus partition definitions,
when the users explicitly decide to join tables as partitions that have
different column ordering than the parent table.  Any COPY or INSERT
command without an explicit column list that tries to put tuples in the
table will fail after the restore.

Tom Lane said:

> I haven't looked at the partitioning code, but I am quite sure that that's
> always happened for old-style inheritance children, and I imagine pg_dump
> is just duplicating that old behavior.

Actually, the new code is unrelated to the old one; for legacy
inheritance, the children are always created exactly as they were
created at definition time.  If you use ALTER TABLE ... INHERITS
(attach a table as a children after creation) then obviously the child
table cannot be modified to match its new parent; and pg_dump reproduces
the exact column ordering that the table originally had.  If you use
"CREATE TABLE ... INHERITS (parent)" then the child columns are reordered
*at that point* (creation time); the dump will, again, reproduce the
exact same definition.


I think failing to reproduce the exact same definition is a pg_dump bug
that should be fixed and backpatched to pg10.  It's just sheer luck that
nobody has complained of being bitten by it.

-- 
Álvaro Herrerahttp://www.twitter.com/alvherre




patch that explains Log-Shipping standby server major upgrades

2019-04-23 Thread Konstantin Evteev
Hello All,

Please find attached a patch that explains Log-Shipping standby server
major upgrades.
We agreed to do this in
https://www.postgresql.org/message-id/CAA4eK1%2Bo6ErVAh484VtE91wow1-uOysohSvb0TS52Ei76PzOKg%40mail.gmail.com

Thanks,
Konstantin Evteev
diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 0c4b16d32c..de0046c482 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -346,7 +346,7 @@ NET STOP postgresql-
 

 
-   
+   
 Prepare for standby server upgrades
 
 
@@ -361,6 +361,20 @@ NET STOP postgresql-
  replica in the postgresql.conf file on the
  new primary cluster.
 
+
+
+
+ If you use Log-Shipping standby servers (without streaming), the last
+ file in which shutdown checkpoint record is written won't be archived.
+ To make the standby servers caught up you need to copy the last WAL
+ file from primary to the standby servers and wait till it is applied.
+ After that standby servers can issue restartpoint at the same location
+ as in the stopped master. As alternative before  you can
+ switch from Log-Shipping
+ to Streaming Replication.
+
+

 



Re: finding changed blocks using WAL scanning

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 10:09:39AM -0700, Andres Freund wrote:

Hi,

On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote:

On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:
> Hi,
>
> On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
> > Well, the thing is that for prefetching to be possible you actually have
> > to be a bit behind. Otherwise you can't really look forward which blocks
> > will be needed, right?
> >
> > IMHO the main use case for prefetching is when there's a spike of activity
> > on the primary, making the standby to fall behind, and then hours takes
> > hours to catch up. I don't think the cases with just a couple of MBs of
> > lag are the issue prefetching is meant to improve (if it does, great).
>
> I'd be surprised if a good implementation didn't. Even just some smarter
> IO scheduling in the startup process could help a good bit. E.g. no need
> to sequentially read the first and then the second block for an update
> record, if you can issue both at the same time - just about every
> storage system these days can do a number of IO requests in parallel,
> and it nearly halves latency effects. And reading a few records (as in a
> few hundred bytes commonly) ahead, allows to do much more than that.
>

I don't disagree with that - prefetching certainly can improve utilization
of the storage system. The question is whether it can meaningfully improve
performance of the recovery process in cases when it does not lag.  And I
think it can't (perhaps with remote_apply being  an exception).


Well. I think a few dozen records behind doesn't really count as "lag",
and I think that's where it'd start to help (and for some record types
like updates it'd start to help even for single records). It'd convert
scenarios where we'd currently fall behind slowly into scenarios where
we can keep up - but where there's no meaningful lag while we keep up.
What's your argument for me being wrong?



I was not saying you are wrong. I think we actually agree on the main
points. My point is that prefetching is most valuable for cases when the
standby can't keep up and falls behind significantly - at which point we
have sufficient queue of blocks to prefetch. I don't care about the case
when the standby can keep up even without prefetching, because the metric
we need to optimize (i.e. lag) is close to 0 even without prefetching.


And even if we'd keep up without any prefetching, issuing requests in a
more efficient manner allows for more efficient concurrent use of the
storage system. It'll often effectively reduce the amount of random
iops.


Maybe, although the metric we (and users) care about the most is the
amount of lag. If the system keeps up even without prefetching, no one
will complain about I/O utilization.

When the lag is close to 0, the average throughput/IOPS/... is bound to be
the same in both cases, because it does not affect how fast the standby
receives WAL from the primary. Except that it's somewhat "spikier" with
prefetching, because we issue requests in bursts. Which may actually be a
bad thing.

Of course, maybe prefetching will make it much more efficient even in the
"no lag" case, and while it won't improve the recovery, it'll leave more
I/O bandwidth for the other processes (say, queries on hot standby).

So to be clear, I'm not against prefetching even in this case, but it's
not the primary reason why I think we need to do that.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Pathological performance when inserting many NULLs into a unique index

2019-04-23 Thread Peter Geoghegan
On Fri, Apr 19, 2019 at 6:34 PM Peter Geoghegan  wrote:
> Attached revision does it that way, specifically by adding a new field
> to the insertion scankey struct (BTScanInsertData).

Pushed.

-- 
Peter Geoghegan




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-23 13:31:25 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:
> >> If we invalidate it only when there's no space on the page, then when
> >> should we set it back to available, because if we don't do that, then
> >> we might miss the space due to concurrent deletes.
> 
> > Well, deletes don't traditionally (i.e. with an actual FSM) mark free
> > space as available (for heap). I think RecordPageWithFreeSpace() should
> > issue a invalidation if there's no FSM, and the block goes from full to
> > empty (as there's no half-full IIUC).
> 
> Why wouldn't we implement this just as a mini four-entry FSM stored in
> the relcache, and update the entries according to the same rules we'd
> use for regular FSM entries?

I mean the big difference is that it's not shared memory. So there needs
to be some difference. My suggestion to handle that is to just issue an
invalidation when *increasing* the amount of space.

And sure, leaving that aside we could store one byte per block - it's
just not what the patch has done so far (or rather, it used one byte per
block, but only utilized one bit of it).  It's possible that'd come with
some overhead - I've not thought sufficiently about it: I assume we'd
still start out in each backend assuming each page is empty, and we'd
then rely on RelationGetBufferForTuple() to update that. What I wonder
is if we'd need to check if an on-disk FSM has been created every time
the space on a page is reduced?  I think not, if we use invalidations to
notify others of the existance of an on-disk FSM. There's a small race,
but that seems ok.

Greetings,

Andres Freund




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-23 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:
>> If we invalidate it only when there's no space on the page, then when
>> should we set it back to available, because if we don't do that, then
>> we might miss the space due to concurrent deletes.

> Well, deletes don't traditionally (i.e. with an actual FSM) mark free
> space as available (for heap). I think RecordPageWithFreeSpace() should
> issue a invalidation if there's no FSM, and the block goes from full to
> empty (as there's no half-full IIUC).

Why wouldn't we implement this just as a mini four-entry FSM stored in
the relcache, and update the entries according to the same rules we'd
use for regular FSM entries?

regards, tom lane




Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-23 Thread Fabien COELHO



Hello Tom,


Keep in mind that a large part of the reason why the \cset patch got
bounced was exactly that its detection of \; was impossibly ugly
and broken.  Don't expect another patch using the same logic to
get looked on more favorably.


Although I do not claim that my implementation was very good, I must admit 
that I'm not sure why there would be an issue if the lexer API is made 
aware of the position of embedded \;, if this information is useful for a 
feature.



I'm not really sure how far we should go to try to make this case "work".
To my mind, use of \; in this way represents an intentional defeat of
psql's algorithms for deciding where end-of-query is.


I do not understand: the lexer knows where all queries ends, it just does 
not say so currently?


If that ends up in behavior you don't like, ISTM that's your fault not 
psql's.


ISTM that the user is not responsible for non orthogonal features provided 
by psql or pgbench (we implemented this great feature, but not in all 
cases).



Having said that, I did like the idea of maybe going over to
PQsetSingleRowMode instead of using an explicit cursor.  That
would represent a net decrease of cruftiness here, instead of
layering more cruft on top of what's already a mighty ugly hack.


Possibly, but, without having looked precisely at the implementation, I'm 
afraid that it would result in more messages. Maybe I'm wrong. Also, I'm 
ensure how returning new pg results for each row (as I understood the mode 
from the doc) would interact with \;, i.e. how to know whether the current 
query has changed. Maybe all this has simple answer.



However ... that'd require using PQsendQuery, which means that the
case at hand with \; would result in a server error rather than
surprising client-side behavior.  Is that an acceptable outcome?


I've sent a patch to replace the existing PQexec by PQsendQuery for a not 
directly related feature, see https://commitfest.postgresql.org/23/2096/


--
Fabien.




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-23 15:46:17 +0530, Amit Kapila wrote:
> On Mon, Apr 22, 2019 at 10:34 PM Andres Freund  wrote:
> > On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> > >  /*
> > > @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> > >   BlockNumber blkno,
> > >   cached_target_block;
> > >
> > > - /* The local map must not be set already. */
> > > - Assert(!FSM_LOCAL_MAP_EXISTS);
> > > -
> > >   /*
> > >* Starting at the current last block in the relation and working
> > >* backwards, mark alternating blocks as available.
> > > @@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> > >   blkno = cur_nblocks - 1;
> > >   while (true)
> > >   {
> > > - fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> > > + rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
> > >   if (blkno >= 2)
> > >   blkno -= 2;
> > >   else
> > > @@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber 
> > > cur_nblocks)
> > >   }
> > >
> > >   /* Cache the number of blocks. */
> > > - fsm_local_map.nblocks = cur_nblocks;
> > > + rel->fsm_local_map->nblocks = cur_nblocks;
> > >
> > >   /* Set the status of the cached target block to 'unavailable'. */
> > >   cached_target_block = RelationGetTargetBlock(rel);
> > >   if (cached_target_block != InvalidBlockNumber &&
> > >   cached_target_block < cur_nblocks)
> > > - fsm_local_map.map[cached_target_block] = 
> > > FSM_LOCAL_NOT_AVAIL;
> > > + rel->fsm_local_map->map[cached_target_block] = 
> > > FSM_LOCAL_NOT_AVAIL;
> > >  }
> >
> > I think there shouldn't be any need for this anymore. After this change
> > we shouldn't invalidate the map until there's no space on it - thereby
> > addressing the cost overhead, and greatly reducing the likelihood that
> > the local FSM can lead to increased bloat.

> If we invalidate it only when there's no space on the page, then when
> should we set it back to available, because if we don't do that, then
> we might miss the space due to concurrent deletes.

Well, deletes don't traditionally (i.e. with an actual FSM) mark free
space as available (for heap). I think RecordPageWithFreeSpace() should
issue a invalidation if there's no FSM, and the block goes from full to
empty (as there's no half-full IIUC).

> > >  /*
> > > @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber 
> > > cur_nblocks)
> > >   * This function is used when there is no FSM.
> > >   */
> > >  static BlockNumber
> > > -fsm_local_search(void)
> > > +fsm_local_search(Relation rel)
> > >  {
> > >   BlockNumber target_block;
> > >
> > >   /* Local map must be set by now. */
> > > - Assert(FSM_LOCAL_MAP_EXISTS);
> > > + Assert(FSM_LOCAL_MAP_EXISTS(rel));
> > >
> > > - target_block = fsm_local_map.nblocks;
> > > + target_block = rel->fsm_local_map->nblocks;
> > >   do
> > >   {
> > >   target_block--;
> > > - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
> > > + if (rel->fsm_local_map->map[target_block] == 
> > > FSM_LOCAL_AVAIL)
> > >   return target_block;
> > >   } while (target_block > 0);
> > >
> > > @@ -1189,7 +1164,22 @@ fsm_local_search(void)
> > >* first, which would otherwise lead to the same conclusion again 
> > > and
> > >* again.
> > >*/
> > > - FSMClearLocalMap();
> > > + fsm_clear_local_map(rel);
> >
> > I'm not sure I like this. My inclination would be that we should be able
> > to check the local fsm repeatedly even if there's no space in the
> > in-memory representation - otherwise the use of the local FSM increases
> > bloat.
> >
> 
> Do you mean to say that we always check all the pages (say 4)
> irrespective of their state in the local map?

I was wondering that, yes. But I think just issuing invalidations is the
right approach instead, see above.


> I think we should first try to see in this new scheme (a) when to set
> the map, (b) when to clear it, (c) how to use.  I have tried to
> summarize my thoughts about it, let me know what do you think about
> the same?
> 
> When to set the map.
> At the beginning (the first time relation is used in the backend), the
> map will be clear.  When the first time in the backend, we find that
> FSM doesn't exist and the number of blocks is lesser than
> HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
> exist at that time and mark all or alternate blocks as available.

I think the alternate blocks scheme has to go. It's not defensible.


Greetings,

Andres Freund




Re: Symbol referencing errors

2019-04-23 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Apr-23, Laurenz Albe wrote:
>> Maybe you should open an oracle_fdw issue, but I don't know how
>> much I can help you, since this is the first time I have heard
>> of SmartOS.

> SmartOS is just the continuation of OpenSolaris, AFAIU.

Yeah.  You can see these same link warnings on castoroides and
protosciurus, though they're no longer building HEAD for
lack of C99-compliant system headers.

regards, tom lane




Re: Symbol referencing errors

2019-04-23 Thread Alvaro Herrera
On 2019-Apr-23, Laurenz Albe wrote:

> Maybe you should open an oracle_fdw issue, but I don't know how
> much I can help you, since this is the first time I have heard
> of SmartOS.

SmartOS is just the continuation of OpenSolaris, AFAIU.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: finding changed blocks using WAL scanning

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-23 19:01:29 +0200, Tomas Vondra wrote:
> On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
> > > Well, the thing is that for prefetching to be possible you actually have
> > > to be a bit behind. Otherwise you can't really look forward which blocks
> > > will be needed, right?
> > > 
> > > IMHO the main use case for prefetching is when there's a spike of activity
> > > on the primary, making the standby to fall behind, and then hours takes
> > > hours to catch up. I don't think the cases with just a couple of MBs of
> > > lag are the issue prefetching is meant to improve (if it does, great).
> > 
> > I'd be surprised if a good implementation didn't. Even just some smarter
> > IO scheduling in the startup process could help a good bit. E.g. no need
> > to sequentially read the first and then the second block for an update
> > record, if you can issue both at the same time - just about every
> > storage system these days can do a number of IO requests in parallel,
> > and it nearly halves latency effects. And reading a few records (as in a
> > few hundred bytes commonly) ahead, allows to do much more than that.
> > 
> 
> I don't disagree with that - prefetching certainly can improve utilization
> of the storage system. The question is whether it can meaningfully improve
> performance of the recovery process in cases when it does not lag.  And I
> think it can't (perhaps with remote_apply being  an exception).

Well. I think a few dozen records behind doesn't really count as "lag",
and I think that's where it'd start to help (and for some record types
like updates it'd start to help even for single records). It'd convert
scenarios where we'd currently fall behind slowly into scenarios where
we can keep up - but where there's no meaningful lag while we keep up.
What's your argument for me being wrong?

And even if we'd keep up without any prefetching, issuing requests in a
more efficient manner allows for more efficient concurrent use of the
storage system. It'll often effectively reduce the amount of random
iops.

Greetings,

Andres Freund




Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-23 16:08:18 +0900, Michael Paquier wrote:
> On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> > FWIW, I think the right fix for this is to simply drop the requirement
> > that tablespace paths need to be absolute. It's not buying us anything,
> > it's just making things more complicated. We should just do a simple
> > check against the tablespace being inside PGDATA, and leave it at
> > that. Yes, that can be tricked, but so can the current system.
> 
> convert_and_check_filename() checks after that already, mostly.  For
> TAP tests I am not sure that this would help much though as all the
> nodes of a given test use the same root path for their data folders,
> so you cannot just use "../hoge/" as location.

I don't see the problem here.  Putting the primary and standby PGDATAs
into a subdirectory that also can contain a relatively referenced
tablespace seems trivial?

> I'm not We already generate a warning when a tablespace is in a data
> folder, as this causes issues with recursion lookups of base backups.
> What do you mean in this case?  Forbidding the behavior?  -- Michael

I mostly am talking about replacing

Oid
CreateTableSpace(CreateTableSpaceStmt *stmt)
{
...
/*
 * Allowing relative paths seems risky
 *
 * this also helps us ensure that location is not empty or whitespace
 */
if (!is_absolute_path(location))
ereport(ERROR,
(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 errmsg("tablespace location must be an 
absolute path")));

with a check that forces relative paths to be outside of PGDATA (baring
symlinks). As far as I can tell convert_and_check_filename() would check
just about the opposite.

Greetings,

Andres Freund




Re: finding changed blocks using WAL scanning

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 09:34:54AM -0700, Andres Freund wrote:

Hi,

On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:

Well, the thing is that for prefetching to be possible you actually have
to be a bit behind. Otherwise you can't really look forward which blocks
will be needed, right?

IMHO the main use case for prefetching is when there's a spike of activity
on the primary, making the standby to fall behind, and then hours takes
hours to catch up. I don't think the cases with just a couple of MBs of
lag are the issue prefetching is meant to improve (if it does, great).


I'd be surprised if a good implementation didn't. Even just some smarter
IO scheduling in the startup process could help a good bit. E.g. no need
to sequentially read the first and then the second block for an update
record, if you can issue both at the same time - just about every
storage system these days can do a number of IO requests in parallel,
and it nearly halves latency effects. And reading a few records (as in a
few hundred bytes commonly) ahead, allows to do much more than that.



I don't disagree with that - prefetching certainly can improve utilization
of the storage system. The question is whether it can meaningfully improve
performance of the recovery process in cases when it does not lag.  And I
think it can't (perhaps with remote_apply being  an exception).

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: finding changed blocks using WAL scanning

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-23 18:07:40 +0200, Tomas Vondra wrote:
> Well, the thing is that for prefetching to be possible you actually have
> to be a bit behind. Otherwise you can't really look forward which blocks
> will be needed, right?
> 
> IMHO the main use case for prefetching is when there's a spike of activity
> on the primary, making the standby to fall behind, and then hours takes
> hours to catch up. I don't think the cases with just a couple of MBs of
> lag are the issue prefetching is meant to improve (if it does, great).

I'd be surprised if a good implementation didn't. Even just some smarter
IO scheduling in the startup process could help a good bit. E.g. no need
to sequentially read the first and then the second block for an update
record, if you can issue both at the same time - just about every
storage system these days can do a number of IO requests in parallel,
and it nearly halves latency effects. And reading a few records (as in a
few hundred bytes commonly) ahead, allows to do much more than that.

Greetings,

Andres Freund




Re: finding changed blocks using WAL scanning

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 11:43:05AM -0400, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
>* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> Oh.  Well, I already explained my algorithm for doing that upthread,
> which I believe would be quite cheap.
>
> 1. When you generate the .modblock files, stick all the block
> references into a buffer.  qsort().  Dedup.  Write out in sorted
> order.

Having all of the block references in a sorted order does seem like it
would help, but would also make those potentially quite a bit larger
than necessary (I had some thoughts about making them smaller elsewhere
in this discussion).  That might be worth it though.  I suppose it might
also be possible to line up the bitmaps suggested elsewhere to do
essentially a BitmapOr of them to identify the blocks changed (while
effectively de-duping at the same time).
>>>
>>>I don't see why this would make them bigger than necessary.  If you
>>>sort by relfilenode/fork/blocknumber and dedup, then references to
>>>nearby blocks will be adjacent in the file.  You can then decide what
>>>format will represent that most efficiently on output.  Whether or not
>>>a bitmap is better idea than a list of block numbers or something else
>>>depends on what percentage of blocks are modified and how clustered
>>>they are.
>>
>>Not sure I understand correctly - do you suggest to deduplicate and sort
>>the data before writing them into the .modblock files? Because that the
>>the sorting would make this information mostly useless for the recovery
>>prefetching use case I mentioned elsewhere. For that to work we need
>>information about both the LSN and block, in the LSN order.
>
>I'm not sure I follow- why does the prefetching need to get the blocks
>in LSN order..?  Once the blocks that we know are going to change in the
>next segment have been identified, we could prefetch them all and have
>them ready for when replay gets to them.  I'm not sure that we
>specifically need to have them pre-fetched in the same order that the
>replay happens and it might even be better to fetch them in an order
>that's as sequential as possible to get them in as quickly as possible.

That means we'd have to prefetch all blocks for the whole WAL segment,
which is pretty useless, IMO. A single INSERT (especially for indexes) is
often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
we don't want to prefetch all of that at once? And it's even worse for
larger WAL segments, which are likely to get more common now that it's an
initdb option.


Ah, yeah, I had been thinking about FPIs and blocks and figuring that
16MB wasn't that bad but you're certainly right that we could end up
touching a lot more blocks in a given WAL segment.


I'm pretty sure the prefetching needs to be more like "prefetch the next
1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
WAL". That doesn't mean we can't do some additional optimization (like
reordering them a bit), but there's a point where it gets detrimental
because the kernel will just evict some of the prefetched blocks before we
actually access them. And the larger amount of blocks you prefetch the
more likely that is.


Since we're prefetching, maybe we shouldn't be just pulling the blocks
into the filesystem cache but instead into something we have more
control over..?



Well, yeah, and there was a discussion about that in the prefetching
thread IIRC. But but in that case it's probably even more imporant to
prefetch only a fairly small chunk of blocks (instead of the whole WAL
segment worth of blocks), because shared buffers are usually much smaller
compared to page cache. So we'd probably want some sort of small ring
buffer there, so the LSN information is even more important.


>>So if we want to allow that use case to leverage this infrastructure, we
>>need to write the .modfiles kinda "raw" and do this processing in some
>>later step.
>
>If we really need the LSN info for the blocks, then we could still
>de-dup, picking the 'first modified in this segment at LSN X', or keep
>both first and last, or I suppose every LSN if we really want, and then
>have that information included with the other information about the
>block.  Downstream clients could then sort based on the LSN info if they
>want to have a list of blocks in sorted-by-LSN-order.

Possibly. I don't think keeping just the first block occurence is enough,
particularly for large WAL segmnent sizes, but I agree we can reduce the
stuff a bit (say, ignore references that are less than 1MB apart or so).
We just can't remove all the LSN information entirely.


Yeah, it depends on how much memory we want to use for prefetching and
how large the WAL 

Re: doc: datatype-table and xfunc-c-type-table

2019-04-23 Thread David Fetter
On Sun, Apr 14, 2019 at 02:49:57AM -0400, Chapman Flack wrote:
> Hi,
> 
> Both tables in $subject (in datatype.sgml and xfunc.sgml, respectively)
> contain similar information (though the xfunc one mentions C structs and
> header files, and the datatype one does not, but has a description column)
> and seem similarly out-of-date with respect to the currently supported
> types ... though not identically out-of-date; they have different numbers
> of rows, and different types that are missing.
> 
> How crazy an idea would it be to have include/catalog/pg_type.dat
> augmented with description, ctypename, and cheader fields, and let
> both tables be generated, with their respective columns?

Not at all. Although literate programming didn't catch on, having a
single point of truth is generally good practice.

There are almost certainly other parts of the documentation that
should also be generated from the source code, but that's a matter for
separate threads for the cases where that would make sense.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate




Re: finding changed blocks using WAL scanning

2019-04-23 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> >>>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> > Oh.  Well, I already explained my algorithm for doing that upthread,
> > which I believe would be quite cheap.
> >
> > 1. When you generate the .modblock files, stick all the block
> > references into a buffer.  qsort().  Dedup.  Write out in sorted
> > order.
> 
> Having all of the block references in a sorted order does seem like it
> would help, but would also make those potentially quite a bit larger
> than necessary (I had some thoughts about making them smaller elsewhere
> in this discussion).  That might be worth it though.  I suppose it might
> also be possible to line up the bitmaps suggested elsewhere to do
> essentially a BitmapOr of them to identify the blocks changed (while
> effectively de-duping at the same time).
> >>>
> >>>I don't see why this would make them bigger than necessary.  If you
> >>>sort by relfilenode/fork/blocknumber and dedup, then references to
> >>>nearby blocks will be adjacent in the file.  You can then decide what
> >>>format will represent that most efficiently on output.  Whether or not
> >>>a bitmap is better idea than a list of block numbers or something else
> >>>depends on what percentage of blocks are modified and how clustered
> >>>they are.
> >>
> >>Not sure I understand correctly - do you suggest to deduplicate and sort
> >>the data before writing them into the .modblock files? Because that the
> >>the sorting would make this information mostly useless for the recovery
> >>prefetching use case I mentioned elsewhere. For that to work we need
> >>information about both the LSN and block, in the LSN order.
> >
> >I'm not sure I follow- why does the prefetching need to get the blocks
> >in LSN order..?  Once the blocks that we know are going to change in the
> >next segment have been identified, we could prefetch them all and have
> >them ready for when replay gets to them.  I'm not sure that we
> >specifically need to have them pre-fetched in the same order that the
> >replay happens and it might even be better to fetch them in an order
> >that's as sequential as possible to get them in as quickly as possible.
> 
> That means we'd have to prefetch all blocks for the whole WAL segment,
> which is pretty useless, IMO. A single INSERT (especially for indexes) is
> often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
> we don't want to prefetch all of that at once? And it's even worse for
> larger WAL segments, which are likely to get more common now that it's an
> initdb option.

Ah, yeah, I had been thinking about FPIs and blocks and figuring that
16MB wasn't that bad but you're certainly right that we could end up
touching a lot more blocks in a given WAL segment.

> I'm pretty sure the prefetching needs to be more like "prefetch the next
> 1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
> WAL". That doesn't mean we can't do some additional optimization (like
> reordering them a bit), but there's a point where it gets detrimental
> because the kernel will just evict some of the prefetched blocks before we
> actually access them. And the larger amount of blocks you prefetch the
> more likely that is.

Since we're prefetching, maybe we shouldn't be just pulling the blocks
into the filesystem cache but instead into something we have more
control over..?

> >>So if we want to allow that use case to leverage this infrastructure, we
> >>need to write the .modfiles kinda "raw" and do this processing in some
> >>later step.
> >
> >If we really need the LSN info for the blocks, then we could still
> >de-dup, picking the 'first modified in this segment at LSN X', or keep
> >both first and last, or I suppose every LSN if we really want, and then
> >have that information included with the other information about the
> >block.  Downstream clients could then sort based on the LSN info if they
> >want to have a list of blocks in sorted-by-LSN-order.
> 
> Possibly. I don't think keeping just the first block occurence is enough,
> particularly for large WAL segmnent sizes, but I agree we can reduce the
> stuff a bit (say, ignore references that are less than 1MB apart or so).
> We just can't remove all the LSN information entirely.

Yeah, it depends on how much memory we want to use for prefetching and
how large the WAL segments are.

> >>Now, maybe the incremental backup use case is so much more important the
> >>right thing to do is ignore this other use case, and I'm OK with that -
> >>as long as it's a conscious choice.
> >
> >I'd certainly like to have a way to prefetch, but I'm not entirely sure
> >that it makes sense to combine it with this, so while I sketched out

Re: finding changed blocks using WAL scanning

2019-04-23 Thread Tomas Vondra

On Tue, Apr 23, 2019 at 10:22:46AM -0400, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
>On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
>>> Oh.  Well, I already explained my algorithm for doing that upthread,
>>> which I believe would be quite cheap.
>>>
>>> 1. When you generate the .modblock files, stick all the block
>>> references into a buffer.  qsort().  Dedup.  Write out in sorted
>>> order.
>>
>>Having all of the block references in a sorted order does seem like it
>>would help, but would also make those potentially quite a bit larger
>>than necessary (I had some thoughts about making them smaller elsewhere
>>in this discussion).  That might be worth it though.  I suppose it might
>>also be possible to line up the bitmaps suggested elsewhere to do
>>essentially a BitmapOr of them to identify the blocks changed (while
>>effectively de-duping at the same time).
>
>I don't see why this would make them bigger than necessary.  If you
>sort by relfilenode/fork/blocknumber and dedup, then references to
>nearby blocks will be adjacent in the file.  You can then decide what
>format will represent that most efficiently on output.  Whether or not
>a bitmap is better idea than a list of block numbers or something else
>depends on what percentage of blocks are modified and how clustered
>they are.

Not sure I understand correctly - do you suggest to deduplicate and sort
the data before writing them into the .modblock files? Because that the
the sorting would make this information mostly useless for the recovery
prefetching use case I mentioned elsewhere. For that to work we need
information about both the LSN and block, in the LSN order.


I'm not sure I follow- why does the prefetching need to get the blocks
in LSN order..?  Once the blocks that we know are going to change in the
next segment have been identified, we could prefetch them all and have
them ready for when replay gets to them.  I'm not sure that we
specifically need to have them pre-fetched in the same order that the
replay happens and it might even be better to fetch them in an order
that's as sequential as possible to get them in as quickly as possible.



That means we'd have to prefetch all blocks for the whole WAL segment,
which is pretty useless, IMO. A single INSERT (especially for indexes) is
often just ~100B, so a single 16MB segment can fit ~160k of them. Surely
we don't want to prefetch all of that at once? And it's even worse for
larger WAL segments, which are likely to get more common now that it's an
initdb option.

I'm pretty sure the prefetching needs to be more like "prefetch the next
1024 blocks we'll need" or "prefetch blocks from the next X megabytes of
WAL". That doesn't mean we can't do some additional optimization (like
reordering them a bit), but there's a point where it gets detrimental
because the kernel will just evict some of the prefetched blocks before we
actually access them. And the larger amount of blocks you prefetch the
more likely that is.



So if we want to allow that use case to leverage this infrastructure, we
need to write the .modfiles kinda "raw" and do this processing in some
later step.


If we really need the LSN info for the blocks, then we could still
de-dup, picking the 'first modified in this segment at LSN X', or keep
both first and last, or I suppose every LSN if we really want, and then
have that information included with the other information about the
block.  Downstream clients could then sort based on the LSN info if they
want to have a list of blocks in sorted-by-LSN-order.



Possibly. I don't think keeping just the first block occurence is enough,
particularly for large WAL segmnent sizes, but I agree we can reduce the
stuff a bit (say, ignore references that are less than 1MB apart or so).
We just can't remove all the LSN information entirely.


Now, maybe the incremental backup use case is so much more important the
right thing to do is ignore this other use case, and I'm OK with that -
as long as it's a conscious choice.


I'd certainly like to have a way to prefetch, but I'm not entirely sure
that it makes sense to combine it with this, so while I sketched out
some ideas about how to do that above, I don't want it to come across as
being a strong endorsement of the overall idea.

For pre-fetching purposes, for an async streaming replica, it seems like
the wal sender process could potentially just scan the WAL and have a
list of blocks ready to pass to the replica which are "this is what's
coming soon" or similar, rather than working with the modfiles at all.
Not sure if we'd always send that or if we wait for the replica to ask
for it.  Though for doing WAL replay from the archive, being able to ask
for the modfile first to do prefetching before replaying the WAL itself
could certainly be beneficial, so maybe it does make sense to have that
information there too..  still not 

Re: finding changed blocks using WAL scanning

2019-04-23 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Sat, Apr 20, 2019 at 04:21:52PM -0400, Robert Haas wrote:
> >On Sat, Apr 20, 2019 at 12:42 AM Stephen Frost  wrote:
> >>> Oh.  Well, I already explained my algorithm for doing that upthread,
> >>> which I believe would be quite cheap.
> >>>
> >>> 1. When you generate the .modblock files, stick all the block
> >>> references into a buffer.  qsort().  Dedup.  Write out in sorted
> >>> order.
> >>
> >>Having all of the block references in a sorted order does seem like it
> >>would help, but would also make those potentially quite a bit larger
> >>than necessary (I had some thoughts about making them smaller elsewhere
> >>in this discussion).  That might be worth it though.  I suppose it might
> >>also be possible to line up the bitmaps suggested elsewhere to do
> >>essentially a BitmapOr of them to identify the blocks changed (while
> >>effectively de-duping at the same time).
> >
> >I don't see why this would make them bigger than necessary.  If you
> >sort by relfilenode/fork/blocknumber and dedup, then references to
> >nearby blocks will be adjacent in the file.  You can then decide what
> >format will represent that most efficiently on output.  Whether or not
> >a bitmap is better idea than a list of block numbers or something else
> >depends on what percentage of blocks are modified and how clustered
> >they are.
> 
> Not sure I understand correctly - do you suggest to deduplicate and sort
> the data before writing them into the .modblock files? Because that the
> the sorting would make this information mostly useless for the recovery
> prefetching use case I mentioned elsewhere. For that to work we need
> information about both the LSN and block, in the LSN order.

I'm not sure I follow- why does the prefetching need to get the blocks
in LSN order..?  Once the blocks that we know are going to change in the
next segment have been identified, we could prefetch them all and have
them ready for when replay gets to them.  I'm not sure that we
specifically need to have them pre-fetched in the same order that the
replay happens and it might even be better to fetch them in an order
that's as sequential as possible to get them in as quickly as possible.

> So if we want to allow that use case to leverage this infrastructure, we
> need to write the .modfiles kinda "raw" and do this processing in some
> later step.

If we really need the LSN info for the blocks, then we could still
de-dup, picking the 'first modified in this segment at LSN X', or keep
both first and last, or I suppose every LSN if we really want, and then
have that information included with the other information about the
block.  Downstream clients could then sort based on the LSN info if they
want to have a list of blocks in sorted-by-LSN-order.

> Now, maybe the incremental backup use case is so much more important the
> right thing to do is ignore this other use case, and I'm OK with that -
> as long as it's a conscious choice.

I'd certainly like to have a way to prefetch, but I'm not entirely sure
that it makes sense to combine it with this, so while I sketched out
some ideas about how to do that above, I don't want it to come across as
being a strong endorsement of the overall idea.

For pre-fetching purposes, for an async streaming replica, it seems like
the wal sender process could potentially just scan the WAL and have a
list of blocks ready to pass to the replica which are "this is what's
coming soon" or similar, rather than working with the modfiles at all.
Not sure if we'd always send that or if we wait for the replica to ask
for it.  Though for doing WAL replay from the archive, being able to ask
for the modfile first to do prefetching before replaying the WAL itself
could certainly be beneficial, so maybe it does make sense to have that
information there too..  still not sure we really need it in LSN order
or that we need to prefetch in LSN order though.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-23 Thread Tom Lane
"Daniel Verite"  writes:
>   Fabien COELHO wrote:
>> I added some stuff to extract embedded "\;" for pgbench "\cset", which has 
>> been removed though, but it is easy to add back a detection of "\;", and 
>> also to detect select. If the position of the last select is known, the 
>> cursor can be declared in the right place, which would also solve the 
>> problem.

> Thanks, I'll extract the necessary bits from your patch.
> I don't plan to go as far as injecting a DECLARE CURSOR inside
> the query, but rather just forbid the use of the cursor in
> the combined-queries case.

Keep in mind that a large part of the reason why the \cset patch got
bounced was exactly that its detection of \; was impossibly ugly
and broken.  Don't expect another patch using the same logic to
get looked on more favorably.

I'm not really sure how far we should go to try to make this case "work".
To my mind, use of \; in this way represents an intentional defeat of
psql's algorithms for deciding where end-of-query is.  If that ends up
in behavior you don't like, ISTM that's your fault not psql's.

Having said that, I did like the idea of maybe going over to
PQsetSingleRowMode instead of using an explicit cursor.  That
would represent a net decrease of cruftiness here, instead of
layering more cruft on top of what's already a mighty ugly hack.

However ... that'd require using PQsendQuery, which means that the
case at hand with \; would result in a server error rather than
surprising client-side behavior.  Is that an acceptable outcome?

regards, tom lane




Re: New vacuum option to do only freezing

2019-04-23 Thread Masahiko Sawada
On Thu, Apr 18, 2019 at 3:12 PM Masahiko Sawada  wrote:
>
> On Thu, Apr 18, 2019 at 4:20 AM Peter Geoghegan  wrote:
> >
> > On Wed, Apr 17, 2019 at 10:46 AM Tom Lane  wrote:
> > > Yeah, if we wanted to expose these complications more directly, we
> > > could think about adding or changing the main counters.  I was wondering
> > > about whether we should have it report "x bytes and y line pointers
> > > freed", rather than counting tuples per se.
> >
>
> It looks good idea to me.
>
> > I like that idea, but I'm pretty sure that there are very few users
> > that are aware of these distinctions at all. It would be a good idea
> > to clearly document them.
>
> I completely agreed. I'm sure that only a few user can do the action
> of enabling index cleanup when the report says there are many dead
> line pointers in the table.
>
> It brought me an another idea of reporting something like "x bytes
> freed, y bytes can be freed after index cleanup". That is, we report
> how much bytes including tuples and line pointers we freed and how
> much bytes of line pointers can be freed after index cleanup. While
> index cleanup is enabled, the latter value should always be 0. If the
> latter value gets large user can be aware of necessity of doing index
> cleanup.
>

Attached the draft version of patch based on the discussion so far.
This patch fixes two issues: the assertion error topminnow reported
and the contents of the vacuum logs.

For the first issue, I've changed lazy_scan_heap so that it counts the
tuples that became dead after HOT pruning in nkeep when index cleanup
is disabled. As per discussions so far, it would be no problem to try
to freeze tuples that ought to have been deleted.

For the second issue, I've changed lazy vacuum so that it reports both
the number of kilobytes we freed and the number of kilobytes can be
freed after index cleanup. The former value includes the size of not
only heap tuples but also line pointers. That is, when a normal tuple
has been marked as either dead or redirected we count only the size of
heap tuple, and when it has been marked as unused we count the size of
both heap tuple and line pointer. Similarly when either a redirect
line pointer or a dead line pointer become unused we count only the
size of line pointer. The latter value we report could be non-zero
only when index cleanup is disabled; it counts the number of bytes of
dead line pointers we left. The advantage of this change is that user
can be aware of both how many bytes we freed and how many bytes we
left due to skipping index cleanup. User can be aware of the necessity
of doing index cleanup by seeing the latter value.

Also with this patch, we count only tuples that has been marked as
unused as deleted tuples. The action of replacing a dead root tuple
with a dead line pointer doesn't count for anything.  It would be
close to the meaning of "deleted tuples" and less confusion. We do
that when actually marking rather than when recording because we could
record and forget dead tuples.

Here is the sample of the report by VACUUM VERBOSE. I used the
following script and the vacuum report is changed by the patch.

* Script
DROP TABLE IF EXISTS test;
CREATE TABLE test (c int primary key, d int);
INSERT INTO test SELECT * FROM  generate_series(1,1);
DELETE FROM test WHERE c < 2000;
VACUUM (INDEX_CLEANUP FALSE, VERBOSE) test;
VACUUM (INDEX_CLEANUP TRUE, VERBOSE) test;

* HEAD (when index cleanup is disabled)
INFO:  vacuuming "public.test"
INFO:  "test": found 1999 removable, 8001 nonremovable row versions in
45 out of 45 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 504
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
0 tuples and 1999 item identifiers are left as dead.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

* Patched (when index cleanup is disabled)
INFO:  vacuuming "public.test"
INFO:  "test": 55 kB freed, 7996 bytes can be freed after index
cleanup, table size: 360 kB
DETAIL:  found 8001 nonremovable row versions in 45 out of 45 pages.
0 dead row versions cannot be removed yet, oldest xmin: 1660
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

* HEAD (when index cleanup is enabled)
INFO:  vacuuming "public.test"
INFO:  scanned index "test_pkey" to remove 1999 row versions
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  "test": removed 1999 row versions in 9 pages
DETAIL:  CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO:  index "test_pkey" now contains 8001 row versions in 30 pages
DETAIL:  1999 index row versions were removed.
5 index pages have been deleted, 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  "test": found 0 removable, 91 nonremovable row versions in 10
out of 45 pages
DETAIL:  0 dead row versions cannot 

Re: block-level incremental backup

2019-04-23 Thread Anastasia Lubennikova

22.04.2019 2:02, Robert Haas wrote:

I think we're getting closer to a meeting of the minds here, but I
don't think it's intrinsically necessary to rewrite the whole method
of operation of pg_basebackup to implement incremental backup in a
sensible way.  One could instead just do a straightforward extension
to the existing BASE_BACKUP command to enable incremental backup.
Then, to enable parallel full backup and all sorts of out-of-core
hacking, one could expand the command language to allow tools to
access individual steps: START_BACKUP, SEND_FILE_LIST,
SEND_FILE_CONTENTS, STOP_BACKUP, or whatever.  The second thing makes
for an appealing project, but I do not think there is a technical
reason why it has to be done first.  Or for that matter why it has to
be done second.  As I keep saying, incremental backup and full backup
are separate projects and I believe it's completely reasonable for
whoever is doing the work to decide on the order in which they would
like to do the work.

Having said that, I'm curious what people other than Stephen (and
other pgbackrest hackers) think about the relative value of parallel
backup vs. incremental backup.  Stephen appears quite convinced that
parallel backup is full of win and incremental backup is a bit of a
yawn by comparison, and while I certainly would not want to discount
the value of his experience in this area, it sometimes happens on this
mailing list that [ drum roll please ] not everybody agrees about
everything.  So, what do other people think?


Personally, I believe that incremental backups are more useful to implement
first since they benefit both backup speed and the space taken by a backup.
Frankly speaking, I'm a bit surprised that the discussion of parallel 
backups

took so much of this thread.
Of course, we must keep it in mind, while designing the API to avoid 
introducing

any architectural obstacles, but any further discussion of parallelism is a
subject of another topic.


I understand Stephen's concerns about the difficulties of incremental backup
management.
Even with an assumption that user is ready to manage backup chains, 
retention,
and other stuff, we must consider the format of backup metadata that 
will allow

us to perform some primitive commands:

1) Tell whether this backup full or incremental.

2) Tell what backup is a parent of this incremental backup.
Probably, we can limit it to just returning "start_lsn", which later can be
compared to "stop_lsn" of parent backup.

3) Take an incremental backup based on this backup.
Here we must help a backup manager to retrieve the LSN to pass it to
pg_basebackup.

4) Restore an incremental backup into a directory (on top of already 
restored

full backup).
One may use it to perform "merge" or "restore" of the incremental backup,
depending on the destination directory.
I wonder if it is possible to integrate it into any existing tool, or we 
end up

with something like pg_basebackup/pg_baserestore as in case of
pg_dump/pg_restore.

Have you designed these? I may only recall "pg_combinebackup" from the very
first message in this thread, which looks more like a sketch to explain the
idea, rather than the thought-out feature design. I also found a page
https://wiki.postgresql.org/wiki/Incremental_backup that raises the same
questions.
I'm volunteering to write a draft patch or, more likely, set of patches, 
which

will allow us to discuss the subject in more detail.
And to do that I wish we agree on the API and data format (at least 
broadly).

Looking forward to hearing your thoughts.


As I see it, ideally the backup management tools should concentrate more on
managing multiple backups, while all the logic of taking a single backup 
(of any
kind) should be integrated into the core. It means that any out-of-core 
client
won't have to walk the PGDATA directory and care about all the postgres 
specific
knowledge of data files consisting of blocks with headers and LSNs and 
so on. It

simply requests data and gets it.
Understandably, it won't be implemented in one take and what is more 
probably,

it is not reachable fully.
Still, it will be great to do our best to provide such tools (both 
existing and

future) with conveniently formatted data and API to get it.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread David Rowley
On Tue, 23 Apr 2019 at 18:18, Amit Langote
 wrote:
>
> If partitions needed a
> map in the old database, this patch means that they will *continue* to
> need it in the new database.

That's incorrect.  My point was about dropped columns being removed
after a dump / reload.  Only binary upgrade mode preserves
pg_attribute entries for dropped columns. Normal mode does not, so the
maps won't be needed after the reload if they were previously only
needed due to dropped columns.  This is the case both with and without
the pg_dump changes I proposed.  The case the patch does change is if
the columns were actually out of order, which I saw as an unlikely
thing to happen in the real world.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
At Tue, 23 Apr 2019 19:00:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190423.190054.262966274.horiguchi.kyot...@lab.ntt.co.jp>
> > For TAP tests, we can point generated temporary directory by
> > "../../".

Wrong. A generating tmpdir (how?) in "../" (that is, in the node
direcotry) would work.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-23 Thread Amit Kapila
On Mon, Apr 22, 2019 at 10:34 PM Andres Freund  wrote:
> On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> >  /*
> > @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> >   BlockNumber blkno,
> >   cached_target_block;
> >
> > - /* The local map must not be set already. */
> > - Assert(!FSM_LOCAL_MAP_EXISTS);
> > -
> >   /*
> >* Starting at the current last block in the relation and working
> >* backwards, mark alternating blocks as available.
> > @@ -1142,7 +1117,7 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> >   blkno = cur_nblocks - 1;
> >   while (true)
> >   {
> > - fsm_local_map.map[blkno] = FSM_LOCAL_AVAIL;
> > + rel->fsm_local_map->map[blkno] = FSM_LOCAL_AVAIL;
> >   if (blkno >= 2)
> >   blkno -= 2;
> >   else
> > @@ -1150,13 +1125,13 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> >   }
> >
> >   /* Cache the number of blocks. */
> > - fsm_local_map.nblocks = cur_nblocks;
> > + rel->fsm_local_map->nblocks = cur_nblocks;
> >
> >   /* Set the status of the cached target block to 'unavailable'. */
> >   cached_target_block = RelationGetTargetBlock(rel);
> >   if (cached_target_block != InvalidBlockNumber &&
> >   cached_target_block < cur_nblocks)
> > - fsm_local_map.map[cached_target_block] = FSM_LOCAL_NOT_AVAIL;
> > + rel->fsm_local_map->map[cached_target_block] = 
> > FSM_LOCAL_NOT_AVAIL;
> >  }
>
> I think there shouldn't be any need for this anymore. After this change
> we shouldn't invalidate the map until there's no space on it - thereby
> addressing the cost overhead, and greatly reducing the likelihood that
> the local FSM can lead to increased bloat.
>

If we invalidate it only when there's no space on the page, then when
should we set it back to available, because if we don't do that, then
we might miss the space due to concurrent deletes.

>
> >  /*
> > @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> >   * This function is used when there is no FSM.
> >   */
> >  static BlockNumber
> > -fsm_local_search(void)
> > +fsm_local_search(Relation rel)
> >  {
> >   BlockNumber target_block;
> >
> >   /* Local map must be set by now. */
> > - Assert(FSM_LOCAL_MAP_EXISTS);
> > + Assert(FSM_LOCAL_MAP_EXISTS(rel));
> >
> > - target_block = fsm_local_map.nblocks;
> > + target_block = rel->fsm_local_map->nblocks;
> >   do
> >   {
> >   target_block--;
> > - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
> > + if (rel->fsm_local_map->map[target_block] == FSM_LOCAL_AVAIL)
> >   return target_block;
> >   } while (target_block > 0);
> >
> > @@ -1189,7 +1164,22 @@ fsm_local_search(void)
> >* first, which would otherwise lead to the same conclusion again and
> >* again.
> >*/
> > - FSMClearLocalMap();
> > + fsm_clear_local_map(rel);
>
> I'm not sure I like this. My inclination would be that we should be able
> to check the local fsm repeatedly even if there's no space in the
> in-memory representation - otherwise the use of the local FSM increases
> bloat.
>

Do you mean to say that we always check all the pages (say 4)
irrespective of their state in the local map?

I think we should first try to see in this new scheme (a) when to set
the map, (b) when to clear it, (c) how to use.  I have tried to
summarize my thoughts about it, let me know what do you think about
the same?

When to set the map.
At the beginning (the first time relation is used in the backend), the
map will be clear.  When the first time in the backend, we find that
FSM doesn't exist and the number of blocks is lesser than
HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
exist at that time and mark all or alternate blocks as available.

Also, when we find that none of the blocks are available in the map
(basically they are marked invalid which means we have previously
checked that there is no space in them), we should get the number of
blocks and if they are less than the threshold, then add it to the
map.


When to clear the map?
Once we find out that the particular page doesn't have space, we can
mark the corresponding page in the map as invalid (or not available to
check).  After relation extension, we can check if the latest block is
greater than the threshold value, then we can clear the map.  At
truncate or some other similar times, when relcache entry is
invalidated, automatically the map will be cleared.

How to use the map?
Now, whenever we find the map exists, we can check the blocks that are
marked as available in it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
At Tue, 23 Apr 2019 17:44:18 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190423.174418.262292011.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 16:08:18 +0900, Michael Paquier  
> wrote in <20190423070818.gm2...@paquier.xyz>
> > On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> > > FWIW, I think the right fix for this is to simply drop the requirement
> > > that tablespace paths need to be absolute. It's not buying us anything,
> > > it's just making things more complicated. We should just do a simple
> > > check against the tablespace being inside PGDATA, and leave it at
> > > that. Yes, that can be tricked, but so can the current system.
> > 
> > convert_and_check_filename() checks after that already, mostly.  For
> > TAP tests I am not sure that this would help much though as all the
> > nodes of a given test use the same root path for their data folders,
> > so you cannot just use "../hoge/" as location.  We already generate a
> > warning when a tablespace is in a data folder, as this causes issues
> > with recursion lookups of base backups.  What do you mean in this
> > case?  Forbidding the behavior? 
> 
> Isn't it good enough just warning when we see pg_tblspc twice
> while scanning? The check is not perfect for an "abosolute path"
> that continas '/./' above pgdata directory.

I don't get basebackup recurse. How can I do that? basebackup
rejects non-empty direcoty as a tablespace directory. I'm not
sure about pg_upgrade but it's not a problem as far as we keep
waning on that kind of tablespace directory.

So I propose this:

 - Allow relative path as a tablespace direcotry in exchange for
   issueing WARNING.

  =# CREATE TABLESPACE ts1 LOCATION '../../hoge';
  "WARNING:  tablespace location should be in absolute path"

 - For abosolute paths, we keep warning as before. Of course we
   don't bother '.' and '..'.

  =# CREATE TABLESPACE ts1 LOCATION '/home/.../data';
  "WARNING:  tablespace location should not be in the data directory"


> For TAP tests, we can point generated temporary directory by
> "../../".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Trouble with FETCH_COUNT and combined queries in psql

2019-04-23 Thread Daniel Verite
Fabien COELHO wrote:

> I added some stuff to extract embedded "\;" for pgbench "\cset", which has 
> been removed though, but it is easy to add back a detection of "\;", and 
> also to detect select. If the position of the last select is known, the 
> cursor can be declared in the right place, which would also solve the 
> problem.

Thanks, I'll extract the necessary bits from your patch.
I don't plan to go as far as injecting a DECLARE CURSOR inside
the query, but rather just forbid the use of the cursor in
the combined-queries case.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
At Tue, 23 Apr 2019 16:08:18 +0900, Michael Paquier  wrote 
in <20190423070818.gm2...@paquier.xyz>
> On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> > FWIW, I think the right fix for this is to simply drop the requirement
> > that tablespace paths need to be absolute. It's not buying us anything,
> > it's just making things more complicated. We should just do a simple
> > check against the tablespace being inside PGDATA, and leave it at
> > that. Yes, that can be tricked, but so can the current system.
> 
> convert_and_check_filename() checks after that already, mostly.  For
> TAP tests I am not sure that this would help much though as all the
> nodes of a given test use the same root path for their data folders,
> so you cannot just use "../hoge/" as location.  We already generate a
> warning when a tablespace is in a data folder, as this causes issues
> with recursion lookups of base backups.  What do you mean in this
> case?  Forbidding the behavior? 

Isn't it good enough just warning when we see pg_tblspc twice
while scanning? The check is not perfect for an "abosolute path"
that continas '/./' above pgdata directory.

For TAP tests, we can point generated temporary directory by
"../../".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Kyotaro HORIGUCHI
At Tue, 23 Apr 2019 14:53:28 +0900, Michael Paquier  wrote 
in <20190423055328.gk2...@paquier.xyz>
> On Tue, Apr 23, 2019 at 01:33:39PM +0900, Kyotaro HORIGUCHI wrote:
> > I think this is rahter a testing or debugging feature. This can
> > be apply to all paths, so the variable might be "path_prefix" or
> > something more generic than tablespace_chroot.
> > 
> > Does it make sense?
> 
> A GUC which enforces object creation does not sound like a good idea
> to me, and what you propose would still bite back, for example two
> local nodes could use the same port, but a different Unix socket
> path.

It's not necessarily be port number, but I agree that it's not a
good idea. I prefer allowing relative paths for tablespaces.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Why does ExecComputeStoredGenerated() form a heap tuple

2019-04-23 Thread Peter Eisentraut
On 2019-04-01 11:23, Peter Eisentraut wrote:
> On 2019-03-31 04:57, Andres Freund wrote:
>> while rebasing the remaining tableam patches (luckily a pretty small set
>> now!), I had a few conflicts with ExecComputeStoredGenerated(). While
>> resolving I noticed:
> 
> The core of that code was written a long time ago and perhaps hasn't
> caught up with all the refactoring going on.  I'll look through your
> proposal and update the code.

The attached patch is based on your sketch.  It's clearly better in the
long term not to rely on heap tuples here.  But in testing this change
seems to make it slightly slower, certainly not a speedup as you were
apparently hoping for.


Test setup:

create table t0 (a int, b int);
insert into t0 select generate_series (1, 1000);  -- 10 million
\copy t0 (a) to 'test.dat';

-- for comparison, without generated column
truncate t0;
\copy t0 (a) from 'test.dat';

-- master
create table t1 (a int, b int generated always as (a * 2) stored);
truncate t1;
\copy t1 (a) from 'test.dat';

-- patched
truncate t1;
\copy t1 (a) from 'test.dat';

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From d87b94d555c1d863a65e738a6953a1c87b0af2b2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 23 Apr 2019 10:18:39 +0200
Subject: [PATCH] Don't form heap tuple in ExecComputeStoredGenerated

---
 src/backend/executor/nodeModifyTable.c | 27 --
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/executor/nodeModifyTable.c 
b/src/backend/executor/nodeModifyTable.c
index 444c0c0574..216c3baa64 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -53,6 +53,7 @@
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -254,9 +255,6 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot 
*slot)
MemoryContext oldContext;
Datum  *values;
bool   *nulls;
-   bool   *replaces;
-   HeapTuple   oldtuple, newtuple;
-   boolshould_free;
 
Assert(tupdesc->constr && tupdesc->constr->has_generated_stored);
 
@@ -294,7 +292,10 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot 
*slot)
 
values = palloc(sizeof(*values) * natts);
nulls = palloc(sizeof(*nulls) * natts);
-   replaces = palloc0(sizeof(*replaces) * natts);
+
+   slot_getallattrs(slot);
+   memcpy(values, slot->tts_values, sizeof(*values) * natts);
+   memcpy(nulls, slot->tts_isnull, sizeof(*nulls) * natts);
 
for (int i = 0; i < natts; i++)
{
@@ -311,20 +312,16 @@ ExecComputeStoredGenerated(EState *estate, TupleTableSlot 
*slot)
 
values[i] = val;
nulls[i] = isnull;
-   replaces[i] = true;
}
+   else
+   values[i] = datumCopy(values[i], TupleDescAttr(tupdesc, 
i)->attbyval, TupleDescAttr(tupdesc, i)->attlen);
}
 
-   oldtuple = ExecFetchSlotHeapTuple(slot, true, _free);
-   newtuple = heap_modify_tuple(oldtuple, tupdesc, values, nulls, 
replaces);
-   /*
-* The tuple will be freed by way of the memory context - the slot might
-* only be cleared after the context is reset, and we'd thus potentially
-* double free.
-*/
-   ExecForceStoreHeapTuple(newtuple, slot, false);
-   if (should_free)
-   heap_freetuple(oldtuple);
+   ExecClearTuple(slot);
+   memcpy(slot->tts_values, values, sizeof(*values) * natts);
+   memcpy(slot->tts_isnull, nulls, sizeof(*nulls) * natts);
+   ExecStoreVirtualTuple(slot);
+   ExecMaterializeSlot(slot);
 
MemoryContextSwitchTo(oldContext);
 }
-- 
2.21.0



RE: libpq debug log

2019-04-23 Thread Jamison, Kirk
Hi Aya-san,

I tested your v3 patch and it seemed to work on my Linux environment.
However, the CF Patch Tester detected a build failure (probably on Windows).
Refer to: http://commitfest.cputube.org/

Docs:
It would be better to have reference to the documentations of 
Frontend/Backend Protocol's "Message Format".

Code:
There are some protocol message types from frontend
that you missed indicating (non BYTE1 types):
CancelRequest (F), StartupMessage (F), SSLRequest (F).

Although I haven't tested those actual protocols,
I assume it will be printed as the following since the length and
message will still be recognized.
ex. Timestamp 8 80877103

So you need to indicate these protocol message types as intended.
ex. Timestamp > SSLRequest 8 80877103


Regards,
Kirk Jamison





Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-23 Thread Kyotaro HORIGUCHI
Hello.

At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo  wrote in 

> Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> create redo error, but I suspect some other kind of redo, which depends on
> the files under the directory (they are not copied since the directory is
> not created) and also cannot be covered by the invalid page mechanism,
> could fail. Thanks.

If recovery starts from just after tablespace creation, that's
simple. The Symlink to the removed tablespace is already removed
in the case. Hence server innocently create files directly under
pg_tblspc, not in the tablespace. Finally all files that were
supposed to be created in the removed tablespace are removed
later in recovery.

If recovery starts from recalling page in a file that have been
in the tablespace, XLogReadBufferExtended creates one (perhaps
directly in pg_tblspc as described above) and the files are
removed later in recoery the same way to above. This case doen't
cause FATAL/PANIC during recovery even in master.

XLogReadBufferExtended@xlogutils.c
| * Create the target file if it doesn't already exist.  This lets us cope
| * if the replay sequence contains writes to a relation that is later
| * deleted.  (The original coding of this routine would instead suppress
| * the writes, but that seems like it risks losing valuable data if the
| * filesystem loses an inode during a crash.  Better to write the data
| * until we are actually told to delete the file.)

So buffered access cannot be a problem for the reason above. The
remaining possible issue is non-buffered access to files in
removed tablespaces. This is what I mentioned upthread:

me> but I haven't checked this covers all places where need the same
me> treatment.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-23 Thread Amit Langote
On 2019/04/22 21:45, Etsuro Fujita wrote:
> (2019/04/20 20:53), Laurenz Albe wrote:
>> I propose that PostgreSQL only allows COPY FROM on a foreign table if
>> the FDW
>> implements BeginForeignInsert.  The attached patch implements that.
> 
> I don't think that is a good idea, because there might be some FDWs that
> want to support COPY FROM on foreign tables without providing
> BeginForeignInsert.  (As for INSERT into foreign tables, we actually allow
> FDWs to support it without providing PlanForeignModify,
> BeginForeignModify, or EndForeignModify.)

I now understand why Laurenz's patch would in fact be a regression for
FDWs that do support COPY FROM and partition tuple routing without
providing BeginForeignInsert, although my first reaction was the opposite,
which was based on thinking (without confirming) that it's the core that
would crash due to initialization step being absent, but that's not the case.

The documentation [1] also says:

  If the BeginForeignInsert pointer is set to NULL, no action is taken for
  the initialization.

[1]
https://www.postgresql.org/docs/current/fdw-callbacks.html#FDW-CALLBACKS-UPDATE

> It's permissible to throw an error in BeginForeignInsert, so what I was
> thinking for FDWs that don't want to support COPY FROM and
> INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert
> implementing something like this:
> 
> static void
> fooBeginForeignInsert(ModifyTableState *mtstate,
>   ResultRelInfo *resultRelInfo)
> {
>     Relation    rel = resultRelInfo->ri_RelationDesc;
> 
>     if (mtstate->ps.plan == NULL)
>     ereport(ERROR,
>     (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("cannot copy to foreign table \"%s\"",
>     RelationGetRelationName(rel;
>     else
>     ereport(ERROR,
>     (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>  errmsg("cannot route tuples into foreign table \"%s\"",
>     RelationGetRelationName(rel;
> }

+1

Thanks,
Amit





Re: Symbol referencing errors

2019-04-23 Thread Laurenz Albe
On Tue, 2019-04-23 at 04:26 +, Li Japin wrote:
> Yes, those errors does not impact the postgresql, but when
> I use oracle_fdw extension, I couldn't startup the postgresql,
> and I find that the dlopen throw an error which lead postmaster
> exit, and there is not more information.

That may wall be a bug in oracle_fdw, since I have no reports of
anybody running it on that operating system.

Maybe you should open an oracle_fdw issue, but I don't know how
much I can help you, since this is the first time I have heard
of SmartOS.

Yours,
Laurenz Albe





Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Michael Paquier
On Mon, Apr 22, 2019 at 11:00:03PM -0700, Andres Freund wrote:
> FWIW, I think the right fix for this is to simply drop the requirement
> that tablespace paths need to be absolute. It's not buying us anything,
> it's just making things more complicated. We should just do a simple
> check against the tablespace being inside PGDATA, and leave it at
> that. Yes, that can be tricked, but so can the current system.

convert_and_check_filename() checks after that already, mostly.  For
TAP tests I am not sure that this would help much though as all the
nodes of a given test use the same root path for their data folders,
so you cannot just use "../hoge/" as location.  We already generate a
warning when a tablespace is in a data folder, as this causes issues
with recursion lookups of base backups.  What do you mean in this
case?  Forbidding the behavior? 
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-23 Thread Laurenz Albe
On Mon, 2019-04-22 at 14:07 -0700, Andres Freund wrote:
> How about just applying the patch from v12 on?
> > Then it is a behavior change in a major release, which is acceptable.
> > It requires the imaginary FDW above to add an empty BeginForeignInsert
> > callback function, but will unbreak FDW that slept through the change
> > that added COPY support.
> 
> I fail to see the advantage. It'll still require FDWs to be fixed to
> work correctly for v11, but additionally adds another set of API
> differences that needs to be fixed by another set of FDWs.  I think this
> ship simply has sailed.

I can accept that (having fixed my own FDW), but I am worried that it will
cause problems for FDW users.  Well, I guess they can always avoid COPY if
they don't want FDWs to crash.

Yours,
Laurenz Albe





Re: pg_dump is broken for partition tablespaces

2019-04-23 Thread Amit Langote
On 2019/04/23 14:45, David Rowley wrote:
> On Tue, 23 Apr 2019 at 13:49, Amit Langote
>  wrote:
>>
>> On 2019/04/23 7:51, Alvaro Herrera wrote:
>>> To me, it sounds
>>> unintuitive to accept partitions that don't exactly match the order of
>>> the parent table; but it's been supported all along.
>>
>> You might know it already, but even though column sets of two tables may
>> appear identical, their TupleDescs still may not match due to dropped
>> columns being different in the two tables.
> 
> I think that's the most likely reason that the TupleDescs would differ
> at all.  For RANGE partitions on time series data, it's quite likely
> that new partitions are periodically created to store new data.  If
> the partitioned table those belong to evolved over time, gaining new
> columns and dropping columns that are no longer needed then some
> translation work will end up being required.  From my work on
> 42f70cd9c, I know tuple conversion is not free, so it's pretty good
> that pg_dump will remove the need for maps in this case even with the
> proposed change.

Maybe I'm missing something, but if you're talking about pg_dump changes
proposed in the latest patch that Alvaro posted on April 18, which is to
emit partitions as two steps, then I don't see how that will always
improves things in terms of whether maps are needed or not (regardless of
whether that's something to optimize for or not.)  If partitions needed a
map in the old database, this patch means that they will *continue* to
need it in the new database.  With HEAD, they won't, because partitions
created with CREATE TABLE PARTITION OF will have the same descriptor as
parent, provided the parent is also created afresh in the new database,
which is true in the non-binary-upgrade mode.  The current arrangement, as
I mentioned in my previous email, is partly inspired from the fact that
creating the parent and partition afresh in the new database will lead
them to have the same TupleDesc and hence won't need maps.

Thanks,
Amit





Re: Regression test PANICs with master-standby setup on same machine

2019-04-23 Thread Andres Freund
Hi,

On 2019-04-22 15:52:59 +0530, Kuntal Ghosh wrote:
> Hello hackers,
> 
> With a master-standby setup configured on the same machine, I'm
> getting a panic in tablespace test while running make installcheck.
> 
> 1. CREATE TABLESPACE regress_tblspacewith LOCATION 'blah';
> 2. DROP TABLESPACE regress_tblspacewith;
> 3. CREATE TABLESPACE regress_tblspace LOCATION 'blah';
> -- do some operations in this tablespace
> 4. DROP TABLESPACE regress_tblspace;
> 
> The master panics at the last statement when standby has completed
> applying the WAL up to step 2 but hasn't started step 3.
> PANIC:  could not fsync file
> "pg_tblspc/16387/PG_12_201904072/16384/16446": No such file or
> directory
> 
> The reason is both the tablespace points to the same location. When
> master tries to delete the new tablespace (and requests a checkpoint),
> the corresponding folder is already deleted by the standby while
> applying WAL to delete the old tablespace. I'm able to reproduce the
> issue with the attached script.
> 
> sh standby-server-setup.sh
> make installcheck
> 
> I accept that configuring master-standby on the same machine for this
> test is not okay. But, can we avoid the PANIC somehow? Or, is this
> intentional and I should not include testtablespace in this case?

FWIW, I think the right fix for this is to simply drop the requirement
that tablespace paths need to be absolute. It's not buying us anything,
it's just making things more complicated. We should just do a simple
check against the tablespace being inside PGDATA, and leave it at
that. Yes, that can be tricked, but so can the current system.

That'd make both regression tests easier, as well as operations.

Greetings,

Andres Freund