Re: Minimal logical decoding on standbys

2019-03-01 Thread tushar

Hi,

While testing  this feature  found that - if lots of insert happened on 
the master cluster then pg_recvlogical is not showing the DATA 
information  on logical replication slot which created on SLAVE.


Please refer this scenario -

1)
Create a Master cluster with wal_level=logcal and create logical 
replication slot -
 SELECT * FROM pg_create_logical_replication_slot('master_slot', 
'test_decoding');


2)
Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D 
slave/ -v -R)  and create logical replication slot -
SELECT * FROM pg_create_logical_replication_slot('standby_slot', 
'test_decoding');


3)
X terminal - start  pg_recvlogical  , provide port= ( slave 
cluster)  and specify slot=standby_slot
./pg_recvlogical -d postgres  -p  -s 1 -F 1  -v --slot=standby_slot  
--start -f -


Y terminal - start  pg_recvlogical  , provide port=5432 ( master 
cluster)  and specify slot=master_slot
./pg_recvlogical -d postgres  -p 5432 -s 1 -F 1  -v --slot=master_slot  
--start -f -


Z terminal - run pg_bench  against Master cluster ( ./pg_bench -i -s 10 
postgres)


Able to see DATA information on Y terminal  but not on X.

but same able to see by firing this below query on SLAVE cluster -

SELECT * FROM pg_logical_slot_get_changes('standby_slot', NULL, NULL);

Is it expected ?

regards,
tushar

On 12/17/2018 10:46 PM, Petr Jelinek wrote:

Hi,

On 12/12/2018 21:41, Andres Freund wrote:

I don't like the approach of managing the catalog horizon via those
periodically logged catalog xmin announcements.  I think we instead
should build ontop of the records we already have and use to compute
snapshot conflicts.  As of HEAD we don't know whether such tables are
catalog tables, but that's just a bool that we need to include in the
records, a basically immeasurable overhead given the size of those
records.

IIRC I was originally advocating adding that xmin announcement to the
standby snapshot message, but this seems better.


If we were to go with this approach, there'd be at least the following
tasks:
- adapt tests from [2]
- enforce hot-standby to be enabled on the standby when logical slots
   are created, and at startup if a logical slot exists
- fix issue around btree_xlog_delete_get_latestRemovedXid etc mentioned
   above.
- Have a nicer conflict handling than what I implemented here.  Craig's
   approach deleted the slots, but I'm not sure I like that.  Blocking
   seems more appropriately here, after all it's likely that the
   replication topology would be broken afterwards.
- get_rel_logical_catalog() shouldn't be in lsyscache.[ch], and can be
   optimized (e.g. check wal_level before opening rel etc).


Once we have this logic, it can be used to implement something like
failover slots on-top, by having having a mechanism that occasionally
forwards slots on standbys using pg_replication_slot_advance().


Looking at this from the failover slots perspective. Wouldn't blocking
on conflict mean that we stop physical replication on catalog xmin
advance when there is lagging logical replication on primary? It might
not be too big deal as in that use-case it should only happen if
hs_feedback was off at some point, but just wanted to point out this
potential problem.



--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: Drop type "smgr"?

2019-03-01 Thread Konstantin Knizhnik




On 01.03.2019 1:32, Thomas Munro wrote:

On Fri, Mar 1, 2019 at 10:41 AM Shawn Debnath  wrote:

On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote:

It doesn't make any sense to put things like clog or any other SLRU in
a non-default tablespace though.  It's perfectly OK if not all smgr
implementations know how to deal with tablespaces, and the SLRU
support should just not support that.

If the generic storage manager, or whatever we end up calling it, ends
up being generic enough - its possible that tablespace value would have
to be respected.

Right, you and I have discussed this a bit off-list, but for the
benefit of others, I think what you're getting at with "generic
storage manager" here is something like this: on the one hand, our
proposed revival of SMGR as a configuration point is about is
supporting alternative file layouts for bufmgr data, but at the same
time there is some background noise about direct IO, block encryption,
... and who knows what alternative block storage someone might come up
with ... at the block level.  So although it sounds a bit
contradictory to be saying "let's make all these different SMGRs!" at
the same time as saying "but we'll eventually need a single generic
SMGR that is smart enough to be parameterised for all of these
layouts!", I see why you say it.  In fact, the prime motivation for
putting SLRUs into shared buffers is to get better buffering, because
(anecdotally) slru.c's mini-buffer scheme performs abysmally without
the benefit of an OS page cache.  If we add optional direct IO support
(something I really want), we need it to apply to SLRUs, undo and
relations, ideally without duplicating code, so we'd probably want to
chop things up differently.  At some point I think we'll need to
separate the questions "how to map blocks to filenames and offsets"
and "how to actually perform IO".  I think the first question would be
controlled by the SMGR IDs as discussed, but the second question
probably needs to be controlled by GUCs that control all IO, and/or
special per relation settings (supposing you can encrypt just one
table, as a random example I know nothing about); but that seems way
out of scope for the present projects.  IMHO the best path from here
is to leave md.c totally untouched for now as the SMGR for plain old
relations, while we work on getting these new kinds of bufmgr data
into the tree as a first step, and a later hypothetical direct IO or
whatever project can pay for the refactoring to separate IO from
layout.



I completely agree with this statement:

At some point I think we'll need to separate the questions "how to map blocks to filenames and 
offsets" and "how to actually perform IO".


There are two subsystems developed in PgPro which are integrated in 
Postgres at file IO level: CFS (compressed file system) and SnapFS (fast 
database snapshots).
First one provides page level encryption and compression, second - 
mechanism for fast restoring of database state.
Both are implemented by patching fd.c. My first idea was to implement 
them as alternative storage devices (alternative to md.c). But it will 
require duplication of all segment mapping logic from md.c + file 
descriptors cache from fd.c. It will be nice if it is possible to 
redefine raw file operations (FileWrite, FileRead,...) without affecting 
segment mapping logic.


One more thing... From my point of view one of the drawbacks of Postgres 
is that it requires underlaying file system and is not able to work with 
raw partitions.
It seems to me that bypassing fle system layer can significantly improve 
performance and give more possibilities for IO performance tuning.
Certainly it will require a log of changes in Postgres storage layer so 
this is not what I suggest to implement or even discuss right now.
But it may be useful to keep it in mind in discussions concerning 
"generic storage manager".



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Protect syscache from bloating with negative cache entries

2019-03-01 Thread Kyotaro HORIGUCHI
At Tue, 26 Feb 2019 10:55:18 -0500, Robert Haas  wrote 
in 
> On Mon, Feb 25, 2019 at 1:27 AM Kyotaro HORIGUCHI
>  wrote:
> > > I'd like to see some evidence that catalog_cache_memory_target has any
> > > value, vs. just always setting it to zero.
> >
> > It is artificial (or acutually wont't be repeatedly executed in a
> > session) but anyway what can get benefit from
> > catalog_cache_memory_target would be a kind of extreme.
> 
> I agree.  So then let's not have it.

Ah... Yeah! I see. Andres' concern was that crucial syscache
entries might be blown away during a long idle time. If that
happens, it's enough to just turn off in the almost all of such
cases.

We no longer need to count memory usage without the feature. That
sutff is moved to monitoring feature, which is out of the scope
of the current status of this patch.

> We shouldn't add more mechanism here than actually has value.  It
> seems pretty clear that keeping cache entries that go unused for long
> periods can't be that important; even if we need them again
> eventually, reloading them every 5 or 10 minutes can't hurt that much.
> On the other hand, I think it's also pretty clear that evicting cache
> entries that are being used frequently will have disastrous effects on
> performance; as I noted in the other email I just sent, consider the
> effects of CLOBBER_CACHE_ALWAYS.  No reasonable user is going to want
> to incur a massive slowdown to save a little bit of memory.
> 
> I see that *in theory* there is a value to
> catalog_cache_memory_target, because *maybe* there is a workload where
> tuning that GUC will lead to better performance at lower memory usage
> than any competing proposal.  But unless we can actually see an
> example of such a workload, which so far I don't, we're adding a knob
> that everybody has to think about how to tune when in fact we have no
> idea how to tune it or whether it even needs to be tuned.  That
> doesn't make sense.  We have to be able to document the parameters we
> have and explain to users how they should be used.  And as far as this
> parameter is concerned I think we are not at that point.

In the attached v18,
   catalog_cache_memory_target is removed,
   removed some leftover of removing the hard limit feature, 
   separated catcache clock update during a query into 0003.
   attached 0004 (monitor part) in order just to see how it is working.

v18-0001-Add-dlist_move_tail:
  Just adds dlist_move_tail

v18-0002-Remove-entries-that-haven-t-been-used-for-a-certain-:
  Revised pruning feature.


v18-0003-Asynchronous-update-of-catcache-clock:
  Separated catcache clock update feature.

v18-0004-Syscache-usage-tracking-feature:
  Usage tracking feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 54388a7452eda1faadaa108e1bc21d51844f9224 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 7 Feb 2019 14:56:07 +0900
Subject: [PATCH 1/6] Add dlist_move_tail

We have dlist_push_head/tail and dlist_move_head but not
dlist_move_tail. Add it.
---
 src/include/lib/ilist.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/src/include/lib/ilist.h b/src/include/lib/ilist.h
index b1a5974ee4..659ab1ac87 100644
--- a/src/include/lib/ilist.h
+++ b/src/include/lib/ilist.h
@@ -394,6 +394,25 @@ dlist_move_head(dlist_head *head, dlist_node *node)
 	dlist_check(head);
 }
 
+/*
+ * Move element from its current position in the list to the tail position in
+ * the same list.
+ *
+ * Undefined behaviour if 'node' is not already part of the list.
+ */
+static inline void
+dlist_move_tail(dlist_head *head, dlist_node *node)
+{
+	/* fast path if it's already at the tail */
+	if (head->head.prev == node)
+		return;
+
+	dlist_delete(node);
+	dlist_push_tail(head, node);
+
+	dlist_check(head);
+}
+
 /*
  * Check whether 'node' has a following node.
  * Caution: unreliable if 'node' is not in the list.
-- 
2.16.3

>From c79d5fc86f45e6545cbc257040e46125ffc5cb92 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Fri, 1 Mar 2019 13:32:51 +0900
Subject: [PATCH 2/6] Remove entries that haven't been used for a certain time

Catcache entries happen to be left alone for several reasons. It is
not desirable that such useless entries eat up memory. Catcache
pruning feature removes entries that haven't been accessed for a
certain time before enlarging hash array.
---
 doc/src/sgml/config.sgml  |  19 
 src/backend/tcop/postgres.c   |   2 +
 src/backend/utils/cache/catcache.c| 122 +-
 src/backend/utils/misc/guc.c  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/utils/catcache.h  |  18 
 6 files changed, 171 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6d42b7afe7..737a156bb4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1661,6 +1661,25 @@ include_dir 'conf.

RE: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Matsumura, Ryo
Hi  Meskes-san, Takahashi-san

> If the standard allows it, we want to be able to process it.

I will try to implement it with the Idea-2 that doesn't use PQprepare() and
Takahasi-san's following idea.

> For example, 
> - ECPG convert ":ID" to "$1" and "$1" in the original statement to "$$1"
> - next_insert() do not check "$$1"
> - ECPGdo() reconvert "$$1" to "$1"

But I will probably be late because I don't understand parse.pl very well.
I think that the following rule is made by parse.pl.

 PreparableStmt:
 SelectStmt
 {
 is_in_preparable_stmt = true;  <--- I want to add it.
 $$ = $1;
}
|  InsertStmt
.

The above variable is used in ecpg.trailer.

ecpg_param: PARAM   {
if(is_in_preparable_stmt)
$$ = mm_strdup(replace_dollar_to_something());
else
 $$ = make_name();
 } ;


I will use @1 instend of $$1 because the replacing is almost same as the 
existing replacing function in ecpglib.
Is it good?


Regards
Ryo Matsumura


Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread David Steele

On 2/28/19 6:08 PM, Laurenz Albe wrote:

David Steele wrote:

This patch attempts to document the limitations of the exclusive mode.


Thanks!


+   
+ The primary issue with the exclusive method is that the
+ backup_label file is written into the data directory
+ when pg_start_backup is called and remains until
+ pg_stop_backup is called.  If
+ PostgreSQL or the host terminates abnormally


There should be a comma at the end of this line.


Fixed.


+ then backup_label will be left in the data directory
+ and PostgreSQL will not start. A log message


You should say "*may* not start", because it will if the WAL segment is still 
there.


You are correct.  It's still pretty likely though so I went with "probably".

I added some extra language to the warning that gets emitted in the log. 
 Users are more like to see that than the documentation.


I also addressed a comment from another thread by adding pg_basebackup 
as .e.g. rather than or.


Thanks,
--
-David
da...@pgmasters.net
From fa2391c9f843358d4e2264b62297eb10bbffcc5a Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Fri, 1 Mar 2019 11:14:41 +0200
Subject: [PATCH] Add exclusive backup deprecation notes to documentation.

Update the exclusive backup documentation to explain the limitations of the
exclusive backup mode and make it clear that the feature is deprecated.

Update the log message when the backup_label is found but recovery
cannot proceed.
---
 doc/src/sgml/backup.sgml  | 42 +++
 src/backend/access/transam/xlog.c | 10 ++--
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index a73fd4d044..1d09808b0d 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -948,6 +948,48 @@ SELECT * FROM pg_stop_backup(false, true);


 Making an exclusive low level backup
+
+  
+   
+ The exclusive backup method is deprecated and should be avoided in favor
+ of the non-exclusive backup method, e.g.
+ pg_basebackup.
+   
+
+   
+ The primary issue with the exclusive method is that the
+ backup_label file is written into the data directory
+ when pg_start_backup is called and remains until
+ pg_stop_backup is called.  If
+ PostgreSQL or the host terminates abnormally,
+ then backup_label will be left in the data directory
+ and PostgreSQL probably will not start. A log
+ message recommends that backup_label be removed if 
not
+ restoring from a backup.
+   
+
+   
+ However, if backup_label is present because a restore
+ is actually in progress, then removing it will result in corruption.  For
+ this reason it is not recommended to automate the removal of
+ backup_label.
+   
+
+   
+ Another issue with exclusive backup mode is that it will continue until
+ pg_stop_backup is called, even if the calling process
+ is no longer performing the backup.  The next time
+ pg_start_backup is called it will fail unless
+ pg_stop_backup is called manually first.
+   
+
+   
+ Finally, only one exclusive backup may be run at a time.  However, it is
+ possible to run an exclusive backup at the same time as any number of
+ non-exclusive backups.
+   
+  
+
 
  The process for an exclusive backup is mostly the same as for a
  non-exclusive one, but it differs in a few key steps. This type of backup
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index ecd12fc53a..e12a04414b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6405,14 +6405,20 @@ StartupXLOG(void)
if (!ReadRecord(xlogreader, checkPoint.redo, 
LOG, false))
ereport(FATAL,
(errmsg("could not find 
redo location referenced by checkpoint record"),
-errhint("If you are 
not restoring from a backup, try removing the file \"%s/backup_label\".", 
DataDir)));
+errhint("If you are 
restoring from a backup, touch \"%s/recovery.signal\" and add recovery options 
to \"%s/postgresql.auto.conf\".\n"
+"If you are not 
restoring from a backup, try removing the file \"%s/backup_label\".\n"
+"Be careful: removing 
\"%s/backup_label\" will result in a corrupt cluster if restoring from a 
backup.",
+DataDir, DataDir, 
DataDir, DataDir)));
}
}
else
{
ereport(FATAL,
(errmsg("could not locate required 
checkpoint record"),
-

Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Laurenz Albe
David Steele wrote:
> I added some extra language to the warning that gets emitted in the log. 
>   Users are more like to see that than the documentation.
> 
> I also addressed a comment from another thread by adding pg_basebackup 
> as .e.g. rather than or.

Looks good to me.

Yours,
Laurenz Albe




Re: Remove Deprecated Exclusive Backup Mode

2019-03-01 Thread Martín Marqués
El jue., 28 de feb. de 2019 a la(s) 23:50, Michael Paquier
(mich...@paquier.xyz) escribió:
>
> On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote:
> > El 28/2/19 a las 15:13, David Steele escribió:
> > +   
> > + The exclusive backup method is deprecated and should be avoided in
> > favor
> > + of the non-exclusive backup method or
> > + pg_basebackup.
> > +   
> >
> > Isn't pg_basebackup a non-exclusive backup method?
>
> It seems to me that it is exactly what this sentence means.  Perhaps
> it should be splitted into two sentences for better clarity?

When I read that phrase I see 2 options: non-exclusive backups, and
pg_basebackup. It's as if pg_basebackup wasn't included in the former
group. It might bring some confusion.

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: Remove Deprecated Exclusive Backup Mode

2019-03-01 Thread David Steele

On 3/1/19 11:36 AM, Martín Marqués wrote:

El jue., 28 de feb. de 2019 a la(s) 23:50, Michael Paquier
(mich...@paquier.xyz) escribió:


On Thu, Feb 28, 2019 at 11:07:47PM -0300, Martín Marqués wrote:

El 28/2/19 a las 15:13, David Steele escribió:
+   
+ The exclusive backup method is deprecated and should be avoided in
favor
+ of the non-exclusive backup method or
+ pg_basebackup.
+   

Isn't pg_basebackup a non-exclusive backup method?


It seems to me that it is exactly what this sentence means.  Perhaps
it should be splitted into two sentences for better clarity?


When I read that phrase I see 2 options: non-exclusive backups, and
pg_basebackup. It's as if pg_basebackup wasn't included in the former
group. It might bring some confusion.


This has been updated on the other thread and I CC'd you.

--
-David
da...@pgmasters.net



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Martín Marqués
El vie., 1 de mar. de 2019 a la(s) 06:21, David Steele
(da...@pgmasters.net) escribió:
>
>
> I also addressed a comment from another thread by adding pg_basebackup
> as .e.g. rather than or.

Thanks David,

This looks very good!

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Michael Meskes
Hi Matsumura-san,

> But I will probably be late because I don't understand parse.pl very
> well.
> I think that the following rule is made by parse.pl.
> 
>PreparableStmt:
>SelectStmt
>{
>is_in_preparable_stmt = true;  <--- I want to add it.
>$$ = $1;
>   }
>   |  InsertStmt
>   .

The only way to add this is by creating a replacement production for
this rule. parse.pl cannot do it itself. 

> I will use @1 instend of $$1 because the replacing is almost same as
> the existing replacing function in ecpglib.
> Is it good?

I'd say so.

Thanks.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Thomas Munro
On Thu, Feb 28, 2019 at 10:43 AM Thomas Munro  wrote:
> On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath  wrote:
> > We had a quick offline discussion to get on the same page and we agreed
> > to move forward with Andres' approach above. Attached is patch v10.
> > Here's the overview of the patch:
>
> Thanks.  I will review, and try to rebase my undo patches on top of
> this and see what problems I crash into.

Some initial feedback:

@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
 * the REDO pointer.  Note that smgr must not do anything that'd have to
 * be undone if we decide no checkpoint is needed.
 */
-   smgrpreckpt();
+   PreCheckpoint();

I would call this and the "post" variant something like
SyncPreCheckpoint().  Otherwise it's too general sounding and not
clear which module it's in.

+static const int NSync = lengthof(syncsw);

Unused.

+   fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);

Needs to be closed.

+   path = syncsw[entry->owner].sync_filepath(entry->ftag);

Probably doesn't make much difference, but wouldn't it be better for
the path to be written into a caller-supplied buffer of size
MAXPGPATH?  Then we could have that on the stack instead of alloc/free
for every path.

Hmm, mdfilepath() needs to use GetRelationPath(), and that already
returns palloc'd memory.  Oh well.

+   entry->canceled = true;

Now that we killed the bitmapset, I wonder if we still need this
canceled flag.  What if we just removed the entry from the hash table?
If you killed the canceled flag you could then replace this:

+   if (entry->canceled)
+   break;

.. with another hash table probe to see if the entry went in the
AbsorbSyncRequests() call (having first copied the key into a local
variable since of course "entry" may have been freed).  Or maybe you
don't think that's better, I dunno, just an idea :-)

+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)

Is it a deliberate choice that you pass FileTag objects around by
value?  Rather than, say, pointer to const.  Not really a complaint in
the current coding since it's a small object anyway (not much bigger
than a pointer), but I guess someone might eventually want to make it
into a flexible sized object, or something, I dunno.

-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+ SegmentNumber segno)
 {
-   if (pendingOpsTable)
+   FileTag tag;
+
+   tag.rnode = rnode;
+   tag.forknum = forknum;
+   tag.segno = segno;
+
+   if (IsSyncManagedLocally())
{
/* standalone backend or startup process: fsync state
is local */
-   RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+   RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
}
...

You left this and similar functions in md.c, but I think it needs to
move out to sync.c, and just take a FileTag directly.  Otherwise I
have to write similar functions in undofile.c, and it seems kinda
weird that those modules are worrying about whether sync is managed
locally or the message needs to be sent to the checkpointer, and even
worse, they have to duplicate the loop that deals with
ForwardSyncRequest() failing and retrying.  Concretely I'm saying that
sync.c should define a function like this:

+/*
+ * PostSyncRequest
+ *
+ * Remember locally, or post to checkpointer as appropriate.
+ */
+void
+PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
+{
+   if (IsSyncManagedLocally())
+   {
+   /* standalone backend or startup process: fsync state
is local */
+   RememberSyncRequest(tag, type, owner);
+   }
+   else if (IsUnderPostmaster)
+   {
+   while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
+   pg_usleep(1L);  /* 10 msec seems a
good number */
+   }
+}

Hmm, perhaps it would need to take an argument to say whether it
should keep retrying or return false if it fails; that way
register_dirty_segment() could perform the FileSync() itself if the
queue is full, but register_unlink could tell it to keep trying.  Does
this make sense?

+typedef enum syncrequesttype
+{
+   SYNC_REQUEST,
+   FORGET_REQUEST,
+   FORGET_HIERARCHY_REQUEST,
+   UNLINK_REQUEST
+} syncrequesttype;

These names are too generic for the global C namespace; how about
SYNC_REQ_CANCEL or similar?

+typedef enum syncrequestowner
+{
+   SYNC_MD = 0 /* md smgr */
+} syncrequestowner;

I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.

+/* md callback f

Re: Unneeded parallel safety tests in grouping_planner

2019-03-01 Thread Etsuro Fujita

(2019/02/28 0:46), Robert Haas wrote:

On Wed, Feb 27, 2019 at 7:46 AM Etsuro Fujita
  wrote:

Yet another thing I noticed while working on [1] is this in
grouping_planner:

 /*
  * If the input rel is marked consider_parallel and there's nothing
that's
  * not parallel-safe in the LIMIT clause, then the final_rel can be
marked
  * consider_parallel as well.  Note that if the query has rowMarks or is
  * not a SELECT, consider_parallel will be false for every relation
in the
  * query.
  */
 if (current_rel->consider_parallel&&
 is_parallel_safe(root, parse->limitOffset)&&
 is_parallel_safe(root, parse->limitCount))
 final_rel->consider_parallel = true;

If there is a need to add a LIMIT node, we don't consider generating
partial paths for the final relation below (see commit
0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52), so it seems unnecessary
anymore to assess the parallel-safety of the LIMIT and OFFSET clauses.
To save cycles, why not remove those tests from that function like the
attached?


Because in the future we might want to consider generating
partial_paths in cases where we don't do so today.

I repeatedly made the mistake of believing that I could not bother
setting consider_parallel entirely correctly for one reason or
another, and I've gone through multiple iterations of fixing cases
where I did that and it turned out to cause problems.  I now believe
that we should try to get it right in every case, whether or not we
currently think it's possible for it to matter.  Sometimes it matters
in ways that aren't obvious, and it complicates further development.

I don't think we'd save much by changing this test anyway.  Those
is_parallel_safe() tests aren't entirely free, of course, but they
should be very cheap.


I got the point.  Thanks for the explanation!

Best regards,
Etsuro Fujita




Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-03-01 Thread Etsuro Fujita

(2019/02/28 0:52), Robert Haas wrote:

On Tue, Feb 26, 2019 at 7:26 AM Etsuro Fujita
  wrote:

The parallel safety of the final scan/join target is determined from the
grouping target, not that target, which [ is wrong ]



Your patch looks right to me.


I think it would be better for you to take this one.  Could you?

Best regards,
Etsuro Fujita




Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Martín Marqués
El vie., 1 de mar. de 2019 a la(s) 06:21, David Steele
(da...@pgmasters.net) escribió:
>
> I added some extra language to the warning that gets emitted in the log.
>   Users are more like to see that than the documentation.
>
> I also addressed a comment from another thread by adding pg_basebackup
> as .e.g. rather than or.

More awake, I gave this last patch a second read. Wording is good now.
No objections there at all.

I do think that paragraph 2 and 3 should be merged as it seems the
idea on the third is a continuation of what's in the second.

But even without that change, I believe this patch is good for commit.

Regards,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



RE: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Matsumura, Ryo
Meskes-san

Thank you for your comment.

> The only way to add this is by creating a replacement production for
> this rule. parse.pl cannot do it itself.

I will read README.parser, ecpg.addons, and *.pl to understand.

> > I will use @1 instend of $$1 because the replacing is almost same as
> > the existing replacing function in ecpglib.
> > Is it good?
> 
> I'd say so.

I try it.

Regards
Ryo Matsumura


Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-01 Thread Etsuro Fujita
Hi,

Robert, I CCed you because you are the author of that commit.  Before
that commit ("Rewrite the code that applies scan/join targets to
paths."), apply_scanjoin_target_to_paths() had a boolean parameter named
modify_in_place, and used apply_projection_to_path(), not
create_projection_path(), to adjust scan/join paths when modify_in_place
was true, which allowed us to save cycles at plan creation time by
avoiding creating projection paths, which I think would be a good thing,
but that commit removed that.  Why?

The real reason for this question is: I noticed that projection paths
put on foreign paths will make it hard for FDWs to detect whether there
is an already-well-enough-sorted remote path in the pathlist for the
final scan/join relation as an input relation to GetForeignUpperPaths()
for the UPPERREL_ORDERED step (the IsA(path, ForeignPath) test would not
work well enough to detect remote paths!), so I'm wondering whether we
could revive that parameter like the attached, to avoid the overhead at
plan creation time and to make the FDW work easy.  Maybe I'm missing
something, though.

Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 2815,2821  select sum(c1) from ft1 group by c2 having avg(c1 * (random() <= 1)::int) > 100
   Group Key: ft1.c2
   Filter: (avg((ft1.c1 * ((random() <= '1'::double precision))::integer)) > '100'::numeric)
   ->  Foreign Scan on public.ft1
!Output: c1, c2
 Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1"
  (10 rows)
  
--- 2815,2821 
   Group Key: ft1.c2
   Filter: (avg((ft1.c1 * ((random() <= '1'::double precision))::integer)) > '100'::numeric)
   ->  Foreign Scan on public.ft1
!Output: c2, c1
 Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1"
  (10 rows)
  
***
*** 3039,3045  select sum(c1) filter (where (c1 / c1) * random() <= 1) from ft1 group by c2 ord
   Output: sum(c1) FILTER (WHERE c1 / c1))::double precision * random()) <= '1'::double precision)), c2
   Group Key: ft1.c2
   ->  Foreign Scan on public.ft1
!Output: c1, c2
 Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1"
  (9 rows)
  
--- 3039,3045 
   Output: sum(c1) FILTER (WHERE c1 / c1))::double precision * random()) <= '1'::double precision)), c2
   Group Key: ft1.c2
   ->  Foreign Scan on public.ft1
!Output: c2, c1
 Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1"
  (9 rows)
  
***
*** 3213,3219  select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
 Group Key: ft2.c2
 ->  Foreign Scan on public.ft2
!  Output: c1, c2
   Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
  (6 rows)
  
--- 3213,3219 
 Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
 Group Key: ft2.c2
 ->  Foreign Scan on public.ft2
!  Output: c2, c1
   Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
  (6 rows)
  
***
*** 3261,3267  select array_agg(c1 order by c1 using operator(public.<^)) from ft2 where c2 = 6
 Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
 Group Key: ft2.c2
 ->  Foreign Scan on public.ft2
!  Output: c1, c2
   Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
  (6 rows)
  
--- 3261,3267 
 Output: array_agg(c1 ORDER BY c1 USING <^ NULLS LAST), c2
 Group Key: ft2.c2
 ->  Foreign Scan on public.ft2
!  Output: c2, c1
   Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE (("C 1" < 100)) AND ((c2 = 6))
  (6 rows)
  
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***
*** 237,243  static void apply_scanjoin_target_to_paths(PlannerInfo *root,
  			   List *scanjoin_targets,
  			   List *scanjoin_targets_contain_srfs,
  			   bool scanjoin_target_parallel_safe,
! 			   bool tlist_same_exprs);
  static void create_partitionwise_grouping_paths(PlannerInfo *root,
  	RelOptInfo *input_rel,
  	RelOptInfo *grouped_rel,
--- 237,244 
  			   List *scanjoin_targets,
  			   List *scanjoin_targets_contain_srfs,
  			   bool scanjoin_target_parallel_safe,
! 			   bool tlist_same_exprs,
! 			   bool modify_in_place);
  static void create_partitionwise_grouping_paths(PlannerInfo *root,
  	RelOptInfo *input_rel,
  	RelOptInfo *grouped_rel,
***
*** 2046,2052  grouping_planner(PlannerInfo *root, bool inheritance_update,
  		apply_scanjoin_target_to_paths(root, current_rel, scanjoin_targets,
  	   scanjoi

Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/02/22 22:54), Antonin Houska wrote:
> > Etsuro Fujita  wrote:
> >> On the other hand, the code bit added by
> >> 0001-postgres_fdw-Perform-UPPERREL_ORDERED-step-remotely-v3.patch handles 
> >> the
> >> case where a post-scan/join processing step other than grouping/aggregation
> >> (ie, the final sort or LIMIT/OFFSET) is performed directly on a base or 
> >> join
> >> relation, as in the LIMIT example.  So, the two changes are handling 
> >> different
> >> cases, hence both changes would be required.
> >
> > Do you mean this part?
> 
> No, I don't.  What I meant was this part:
> 
> +   /*
> +* If this is an UPPERREL_ORDERED step performed on the final
> +* scan/join relation, the costs obtained from the cache wouldn't yet
> +* contain the eval costs for the final scan/join target, which would
> +* have been updated by apply_scanjoin_target_to_paths(); add the eval
> +* costs now.
> +*/
> +   if (fpextra && !IS_UPPER_REL(foreignrel))
> +   {
> +   /* The costs should have been obtained from the cache. */
> +   Assert(fpinfo->rel_startup_cost >= 0 &&
> +  fpinfo->rel_total_cost >= 0);
> +
> +   startup_cost += foreignrel->reltarget->cost.startup;
> +   run_cost += foreignrel->reltarget->cost.per_tuple * rows;
> +   }

> Yeah, but I think the code bit cited above is needed.  Let me explain using
> yet another example with grouping and ordering:
> 
> SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;
> 
> For this query, the sort_input_target would be {a+b}, (to which the
> foreignrel->reltarget would have been set when called from
> estimate_path_cost_size() called from postgresGetForeignUpperPaths() with the
> UPPERREL_ORDERED stage), but the final target would be {a+b, random()}, so the
> sort_input_target isn't the same as the final target in that case; the costs
> needs to be adjusted so that the costs include the ones of generating the
> final target.  That's the reason why I added the code bit you cited.

I used gdb to help me understand, however the condition

if (fpextra && !IS_UPPER_REL(foreignrel))

never evaluated to true with the query above. fpextra was only !=NULL when
estimate_path_cost_size() was called from add_foreign_ordered_paths(), but at
the same time foreignrel->reloptkind was RELOPT_UPPER_REL.

It's still unclear to me why add_foreign_ordered_paths() passes the input
relation (input_rel) to estimate_path_cost_size(). If it passed the output rel
(i.e. ordered_rel in this case) like add_foreign_grouping_paths() does, then
foreignrel->reltarget should contain the random() function. (What I see now is
that create_ordered_paths() leaves ordered_rel->reltarget empty, but that's
another problem to be fixed.)

Do you still see a reason to call estimate_path_cost_size() this way?

> > Note about the !activeWindows test: It occurs me now that no paths should be
> > added to UPPERREL_ORDERED relation if the query needs WindowAgg node, 
> > because
> > postgres_fdw currently cannot handle UPPERREL_WINDOW relation. Or rather in
> > general, the FDW should not skip any stage just because it doesn't know how 
> > to
> > build the paths.
> 
> That's right.  In postgres_fdw, by the following code in
> postgresGetForeignUpperPaths(), we will give up on doing the UPPERREL_ORDERED
> step remotely, if the query has the UPPERREL_WINDOW (and/or UPPERREL_DISTINCT)
> steps, because in that case we would have input_rel->fdw_private=NULL for a
> call to postgresGetForeignUpperPaths() with the UPPERREL_ORDERED stage:
> 
> /*
>  * If input rel is not safe to pushdown, then simply return as we cannot
>  * perform any post-join operations on the foreign server.
>  */
> if (!input_rel->fdw_private ||
> !((PgFdwRelationInfo *) input_rel->fdw_private)->pushdown_safe)
> return;

I understand now: if FDW did not handle the previous stage, then
input_rel->fdw_private is NULL.

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Peter Eisentraut
Please follow the 1-space indentation in the documentation files.

I think the style of mentioning all the problems in a note before the
actual description is a bit backwards.

The layout of the section should be:

- This is what it does.

- Here are some comparisons with other methods.

- For this or that reason, it's deprecated.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/02/23 0:21), Antonin Houska wrote:
> > Etsuro Fujita  wrote:
> 
> >>> (2019/02/08 2:04), Antonin Houska wrote:
>  * regression tests: I think test(s) should be added for queries that have
>  ORDER BY clause but do not have GROUP BY (and also no LIMIT / OFFSET)
>  clause. I haven't noticed such tests.
> 
> >> I noticed that such queries would be processed by what we already have for
> >> sort pushdown (ie, add_paths_with_pathkeys_for_rel()).  I might be missing
> >> something, though.
> >
> > What about an ORDER BY expression that contains multiple Var nodes? For
> > example
> >
> > SELECT * FROM foo ORDER BY x + y;
> >
> > I think the base relation should not be able to generate paths with the
> > corresponding pathkeys, since its target should (besides PlaceHolderVars,
> > which should not appear in the plan of this simple query at all) only emit
> > individual Vars.
> 
> Actually, add_paths_with_pathkeys_for_rel() generates such pre-sorted paths
> for the base relation, as shown in the below example using HEAD without the
> patchset proposed in this thread:
> 
> postgres=# explain verbose select a+b from ft1 order by a+b;
> QUERY PLAN
> --
>  Foreign Scan on public.ft1  (cost=100.00..200.32 rows=2560 width=4)
>Output: (a + b)
>Remote SQL: SELECT a, b FROM public.t1 ORDER BY (a + b) ASC NULLS LAST
> (3 rows)
> 
> I think it is OK for that function to generate such paths, as tlists for such
> paths would be adjusted in apply_scanjoin_target_to_paths(), by doing
> create_projection_path() to them.

I've just checked it. The FDW constructs the (a + b) expression for the ORDER
BY clause on its own. It only needs to find the input vars in the relation
target.

> Conversely, it appears that add_foreign_ordered_paths() added by the patchset
> would generate such pre-sorted paths *redundantly* when the input_rel is the
> final scan/join relation.  Will look into that.

Currently I have no idea how to check the plan created by FDW at the
UPPERREL_ORDERED stage, except for removing the sort from the
UPPERREL_GROUP_AGG stage as I proposed here:

https://www.postgresql.org/message-id/11807.1549564431%40localhost

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

2019-03-01 Thread Suraj Kharage
Hi,

The Commit 5955d934194c3888f30318209ade71b53d29777f has changed the logic
to avoid dumping creation and comment commands for the public schema.
>From v11 onwards, we are using the DUMP_COMPONENT_ infrastructure in
selectDumpableNamespace() to skip the public schema creation.

As reported by Prabhat, if we try to restore the custom/tar dump taken from
v10 and earlier versions, we get the reported error for public schema.
The reason for this error is, when we take custom/tar dump from v10 and
earlier version, it has "CREATE SCHEMA public;" statement and v11 failed to
bypass that as per the current logic.

The plain format does not produces the error in this case, because in all
versions, pg_dump in plain format does not generate that "CREATE SCHEMA
public". In v10 and earlier, we filter out that public schema creation in
_printTocEntry() while pg_dump.

In custom/tar format, pg_dump in V10 and earlier versions generate the
schema creation statement for public schema but again while pg_restore in
same or back branches, it get skipped through same _printTocEntry()
function.

I think we can write a logic in -
1) BulidArchiveDependencies() to avoid dumping creation and comment
commands for the public schema since we do not have DUMP_COMPONENT_
infrastructure in all supported back-branches.
or
2) dumpNamespace() to not include public schema creation.

Thoughts?

Regards,
Suraj


Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Andy Fan
for a createStmt,  it will call transformCreateStmt,  and then
heap_create_with_catalog.
but looks it just check the if_not_exists in transformCreateStmt.

so there is a chance that when the transformCreateStmt is called, the table
is not created, but before the heap_create_with_catalog is called,  the
table was created.  if so, the "if not exits" will raise error  "ERROR:
relation "" already exists"

I can reproduce this with gdb,

demo=# create table if not exists 2 (a int);
ERROR:  relation "2" already exists

is it designed as this on purpose or is it a bug?

I am using the lates commit on github now.


pg_background and BGWH_STOPPED

2019-03-01 Thread Švorc Martin
Hello,

We probably identified a bug in the pg_background implementation: 
https://github.com/vibhorkum/pg_background 
It is a race condition when starting the process and BGWH_STOPPED is returned - 
see the pull request for more info: 
https://github.com/RekGRpth/pg_background/pull/1 

I think I have created a fix (in one of the forks of the original repo, 
https://github.com/RekGRpth/pg_background, which already addresses some 
compilation issues), but then again I am not very familiar with PG API and 
would very much appreciate if anyone could review  the bug and approve the  
solution.

Regards

Martin

-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of amul sul
Sent: Thursday, November 24, 2016 4:47 AM
To: PostgreSQL-development
Subject: pg_background contrib module proposal

Hi All,

I would like to take over pg_background patch and repost for discussion and 
review.

Initially Robert Haas has share this for parallelism demonstration[1] and 
abandoned later with summary of open issue[2] with this pg_background patch 
need to be fixed, most of them seems to be addressed in core except handling of 
type exists without binary send/recv functions and documentation.
I have added handling for types that don't have binary send/recv functions in 
the attach patch and will work on documentation at the end.

One concern with this patch is code duplication with exec_simple_query(), we 
could consider Jim Nasby’s patch[3] to overcome this,  but  certainly we will 
end up by complicating
exec_simple_query() to make pg_background happy.

As discussed previously[1] pg_background is a contrib module that lets you 
launch arbitrary command in a background worker.

• VACUUM in background
• Autonomous transaction implementation better than dblink way (i.e.
no separate authentication required).
• Allows to perform task like CREATE INDEX CONCURRENTLY from a procedural 
language.

This module comes with following SQL APIs:

• pg_background_launch : This API takes SQL command, which user wants to 
execute, and size of queue buffer.
  This function returns the process id of background worker.
• pg_background_result   : This API takes the process id as input
parameter and returns the result of command
  executed thought the background worker.
• pg_background_detach : This API takes the process id and detach the 
background process which is waiting for  user to read its results.


Here's an example of running vacuum and then fetching the results.
Notice that the
notices from the original session are propagated to our session; if an error 
had occurred, it would be re-thrown locally when we try to read the results.

postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,5)); INSERT 0 5

postgres=# select pg_background_launch('vacuum verbose foo'); 
pg_background_launch
--
  65427
(1 row)

postgres=# select * from pg_background_result(65427) as (x text);
INFO:  vacuuming "public.foo"
INFO:  "foo": found 0 removable, 5 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
 x

VACUUM
(1 row)


Thanks to Vibhor kumar, Rushabh Lathia and Robert Haas for feedback.

Please let me know your thoughts, and thanks for reading.

[1]. 
https://www.postgresql.org/message-id/CA%2BTgmoam66dTzCP8N2cRcS6S6dBMFX%2BJMba%2BmDf68H%3DKAkNjPQ%40mail.gmail.com
[2]. 
https://www.postgresql.org/message-id/CA%2BTgmobPiT_3Qgjeh3_v%2B8Cq2nMczkPyAYernF_7_W9a-6T1PA%40mail.gmail.com
[3]. https://www.postgresql.org/message-id/54541779.1010906%40BlueTreble.com

Regards,
Amul


Re: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Michael Meskes
Hi Matsumura-san,

> I will read README.parser, ecpg.addons, and *.pl to understand.

Feel free to ask, when anything comes up.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




Re: House style for DocBook documentation?

2019-03-01 Thread Ramanarayana
Hi,
I have tested bug fixes provided by all the patches. They are working
great. I found one minor issue

 select * from xmltable('*' PASSING 'pre&deeppost' COLUMNS x XML PATH '/');

The above query returns the xml. But there is an extra plus symbol at the
end

pre&deeppost+

Regards,
Ram


RE: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Matsumura, Ryo
Hi Meskes-san

I must use a midrule action like the following that works as expected.
I wonder how to write the replacement to ecpg.addons.
I think it's impossible, right? Please give me some advice.

 PrepareStmt:
PREPARE prepared_name prep_type_clause AS
{
prepared_name = $2;
prepare_type_clause = $3;
is_in_preparable_stmt = true;
}
PreparableStmt
{
$$.name = prepared_name;
$$.type = prepare_type_clause;
$$.stmt = $6;
is_in_preparable_stmt = false;
}
| PREPARE prepared_name FROM execstring

Regards
Ryo Matsumura

> -Original Message-
> From: Michael Meskes [mailto:mes...@postgresql.org]
> Sent: Friday, March 1, 2019 8:42 PM
> To: Matsumura, Ryo/松村 量 ; Takahashi,
> Ryohei/高橋 良平 ;
> 'pgsql-hack...@postgresql.org' 
> Subject: Re: SQL statement PREPARE does not work in ECPG
> 
> Hi Matsumura-san,
> 
> > I will read README.parser, ecpg.addons, and *.pl to understand.
> 
> Feel free to ask, when anything comes up.
> 
> Michael
> --
> Michael Meskes
> Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
> Meskes at (Debian|Postgresql) dot Org
> Jabber: michael at xmpp dot meskes dot org
> VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
> 



Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-01 Thread David Rowley
On Sun, 15 Apr 2018 at 19:09, Sergei Kornilov  wrote:
> Attached updated patch follows recent Reorganize partitioning code commit.

I've made a pass over v10. I think it's in pretty good shape, but I
did end up changing a few small things.

1. Tweaked the documents a bit. I was going to just change "Full table
scan" to "A full table scan", but I ended up rewriting the entire
paragraph.
2. In ATRewriteTables, updated comment to mention "If required" since
the scan may not be required if SET NOT NULLs are already proved okay.
3. Updated another forgotten comment in ATRewriteTables.
4. Removed mention about table's with OIDs in ATExecAddColumn(). This
seems left over from 578b229718.
5. Changed ATExecSetNotNull not to check for matching constraints if
we're already going to make a pass over the table. Setting
verify_new_notnull to true again does not make it any more true. :)
6. Tweaked NotNullImpliedByRelConstraints() a bit. You were calling
lappend on an empty list. make_list1() is fine. Don't need a variable
for that, just pass it to the function.
7. Tightened up the comments in ConstraintImpliedByRelConstraint() to
mention that it only supports immutable quals containing vars with
varno=1. Removed the comment near the bottom that mentioned that in
passing.

The only thing that I'm a bit unsure of is the tests. I've read the
thread and I see the discussion above about it.  I'd personally have
thought INFO was fine since ATTACH PARTITION does that, but I see
objections.  It appears that all the tests just assume that the CHECK
constraint was used to validate the SET NOT NULL. I'm not all that
sure if there's any good reason not to set client_min_messages =
'debug1' just before your test then RESET client_min_messages;
afterwards.  No other tests seem to do it, but my only thoughts on the
reason not to would be that it might fail if someone added another
debug somewhere, but so what? they should update the expected results
too.

Oh, one other thing I don't really like is that
ConstraintImpliedByRelConstraint() adds all of the other column's IS
NOT NULL constraints to the existConstraint. This is always wasted
effort for the SET NOT NULL use case.  I wonder if a new flag should
be added to control this, or even better, separate that part out and
put back the function named PartConstraintImpliedByRelConstraint and
have it do the IS NOT NULL building before calling
ConstraintImpliedByRelConstraint(). No duplicate code that way and you
also don't need to touch partbound.c

Setting this on waiting on author.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


alter_table_set_not_null_by_constraints_only_v10_fixes.patch
Description: Binary data


Re: Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Michael Paquier
On Fri, Mar 01, 2019 at 07:17:04PM +0800, Andy Fan wrote:
> for a createStmt,  it will call transformCreateStmt,  and then
> heap_create_with_catalog.
> but looks it just check the if_not_exists in transformCreateStmt.
> 
> is it designed as this on purpose or is it a bug?

That's a bug.  Andreas Karlsson and I have been discussing it a couple
of days ago actually:
https://www.postgresql.org/message-id/20190215081451.gd2...@paquier.xyz

Fixing this is not as straight-forward as it seems, as it requires
shuffling a bit the code related to a CTAS creation so as all code
paths check at the same time for an existing relation.  Based on my
first impressions, I got the feeling that it would be rather invasive
and not worth a back-patch.
--
Michael


signature.asc
Description: PGP signature


Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Etsuro Fujita

(2019/03/01 20:00), Antonin Houska wrote:

Etsuro Fujita  wrote:

(2019/02/22 22:54), Antonin Houska wrote:

Etsuro Fujita   wrote:

So, the two changes are handling different
cases, hence both changes would be required.



+   /*
+* If this is an UPPERREL_ORDERED step performed on the final
+* scan/join relation, the costs obtained from the cache wouldn't yet
+* contain the eval costs for the final scan/join target, which would
+* have been updated by apply_scanjoin_target_to_paths(); add the eval
+* costs now.
+*/
+   if (fpextra&&  !IS_UPPER_REL(foreignrel))
+   {
+   /* The costs should have been obtained from the cache. */
+   Assert(fpinfo->rel_startup_cost>= 0&&
+  fpinfo->rel_total_cost>= 0);
+
+   startup_cost += foreignrel->reltarget->cost.startup;
+   run_cost += foreignrel->reltarget->cost.per_tuple * rows;
+   }



Yeah, but I think the code bit cited above is needed.  Let me explain using
yet another example with grouping and ordering:

 SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;

For this query, the sort_input_target would be {a+b}, (to which the
foreignrel->reltarget would have been set when called from
estimate_path_cost_size() called from postgresGetForeignUpperPaths() with the
UPPERREL_ORDERED stage), but the final target would be {a+b, random()}, so the
sort_input_target isn't the same as the final target in that case; the costs
needs to be adjusted so that the costs include the ones of generating the
final target.  That's the reason why I added the code bit you cited.


I used gdb to help me understand, however the condition

if (fpextra&&  !IS_UPPER_REL(foreignrel))

never evaluated to true with the query above.


Sorry, my explanation was not enough again, but I showed that query 
("SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;") 
to explain why the following code bit is needed:


+   /*
+* If this includes an UPPERREL_ORDERED step, the given target, 
which
+* would be the final target to be applied to the resulting 
path, might
+* have different expressions from the underlying relation's 
reltarget

+* (see make_sort_input_target()); adjust tlist eval costs.
+*/
+   if (fpextra&&  fpextra->target != foreignrel->reltarget)
+   {
+   QualCostoldcost = foreignrel->reltarget->cost;
+   QualCostnewcost = fpextra->target->cost;
+
+   startup_cost += newcost.startup - oldcost.startup;
+   total_cost += newcost.startup - oldcost.startup;
+   total_cost += (newcost.per_tuple - oldcost.per_tuple) * 
rows;

+   }

I think that the condition

if (fpextra && fpextra->target != foreignrel->reltarget)

would be evaluated to true for that query.


It's still unclear to me why add_foreign_ordered_paths() passes the input
relation (input_rel) to estimate_path_cost_size(). If it passed the output rel
(i.e. ordered_rel in this case) like add_foreign_grouping_paths() does, then
foreignrel->reltarget should contain the random() function. (What I see now is
that create_ordered_paths() leaves ordered_rel->reltarget empty, but that's
another problem to be fixed.)

Do you still see a reason to call estimate_path_cost_size() this way?


Yeah, the reason for that is because we can estimate the costs of 
sorting for the final scan/join relation using the existing code as-is 
that estimates the costs of sorting for base or join relations, except 
for tlist eval cost adjustment.  As you mentioned, we could pass 
ordered_rel to estimate_path_cost_size(), but if so, I think we would 
need to get input_rel (ie, the final scan/join relation) from 
ordered_rel within estimate_path_cost_size(), to use that code, which 
would not be great.


Best regards,
Etsuro Fujita




Re: FETCH FIRST clause PERCENT option

2019-03-01 Thread Surafel Temesgen
On Fri, Mar 1, 2019 at 4:33 AM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote in <
> fbd08ad3-5dd8-3169-6cba-38d610d7b...@2ndquadrant.com>
> > > One biggest issue seems to be we don't know the total number of
>
> # One *of* the biggest *issues*?
>
> > > outer tuples before actually reading a null tuple. I doubt of
> > > general shortcut for that. It also seems preventing limit node
> > > from just using materialized outer.
> > >
> >
> > Sure, if you actually want all tuples, you'll have to execute the outer
> > plan till completion. But that's not what I'm talking about - what if we
> > only ever need to read one row from the limit?
>
> We have no choice than once reading all tuples just to find we
> are to return just one row, since estimator is not guaranteed to
> be exact as required for this purpose.
>
> > To give you a (admittedly, somewhat contrived and artificial example):
> >
> > SELECT * FROM t1 WHERE id IN (
> >   SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
> > );
> >
> > Maybe this example is bogus and/or does not really matter in practice. I
> > don't know, but I've been unable to convince myself that's the case.
>
> I see such kind of idiom common. Even in the quite simple example
> above, *we* cannot tell how many tuples the inner should return
> unless we actually fetch all tuples in t2. This is the same
> problem with count(*).
>
> The query is equivalent to the folloing one.
>
>  SELECT * FROM t1 WHERE id IN (
>SELECT id FROM t2 ORDER BY x
>  FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
>  );
>
> This scans t2 twice, but this patch does only one full scan
> moving another partial scan to tuplestore. We would win if the
> outer is complex enough.
>

Okay here is the previous implementation with uptread review comment
included and it also consider OFFSET clause in percentage calculation

regards
Surafel
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..e3ce4d7e36 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY
 
 In this syntax, the start
 or count value is required by
@@ -1440,7 +1440,8 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+with PERCENT count specifies the maximum number of rows to return 
+in percentage.ROW
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..2be6f655fb 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -43,6 +43,7 @@ ExecLimit(PlanState *pstate)
 	LimitState *node = castNode(LimitState, pstate);
 	ScanDirection direction;
 	TupleTableSlot *slot;
+	TupleDesc   tupleDescriptor;
 	PlanState  *outerPlan;
 
 	CHECK_FOR_INTERRUPTS();
@@ -52,6 +53,8 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->subSlot;
+	tupleDescriptor = node->ps.ps_ResultTupleDesc;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -60,6 +63,26 @@ ExecLimit(PlanState *pstate)
 	{
 		case LIMIT_INITIAL:
 
+			if (node->limitOption == PERCENTAGE)
+			{
+
+/*
+ * In PERCENTAGE option the number of tuple to return can't
+ * be determine before receiving the last tuple. so execute
+ * outerPlan until the end and store the result tuple to avoid
+ * executing twice
+ */
+for (;;)
+{
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		break;
+	}
+	tuplestore_puttupleslot(node->tuple_store, slot);
+}
+			}
+
 			/*
 			 * First call for this node, so compute limit/offset. (We can't do
 			 * this any earlier, because parameters from upper nodes will not
@@ -87,24 +110,47 @@ ExecLimit(PlanState *pstate)
 return NULL;
 			}
 
-			/*
-			 * Fetch rows from subplan until we reach position > offset.
-			 */
-			for (;;)
+			if (node->limitOption == PERCENTAGE)
 			{
-slot = ExecProcNode(outerPlan);
-	

Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
> I have reviewed your patch. Good job except two issues I can find:
>
> 1. The patch would give wrong results when the inner side is empty. In this
> case, the whole data from outer side should be in the outputs. But with the
> patch, we will lose the NULLs from outer side.
>
> 2. Because of the new added predicate 'OR (var is NULL)', we cannot use hash
> join or merge join to do the ANTI JOIN.  Nested loop becomes the only choice,
> which is low-efficency.

Yeah. Both of these seem pretty fundamental, so setting the patch to
waiting on author.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Michael Meskes
Hi Matsumura-san,

> I must use a midrule action like the following that works as
> expected.
> I wonder how to write the replacement to ecpg.addons.
> I think it's impossible, right? Please give me some advice.

You are right, for this change you have to replace the whole rule. This
cannot be done with an addon. Please see the attached for a way to do
this, untested though.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index 1d58c778d9..be6c9d0ae9 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1,5 +1,26 @@
 /* src/interfaces/ecpg/preproc/ecpg.trailer */
 
+PrepareStmt: PREPARE prepared_name prep_type_clause AS
+{
+prepared_name = $2;
+prepare_type_clause = $3;
+is_in_preparable_stmt = true;
+}
+PreparableStmt
+{
+$$.name = prepared_name;
+$$.type = prepare_type_clause;
+$$.stmt = $6;
+is_in_preparable_stmt = false;
+}
+| PREPARE prepared_name FROM execstring
+	{
+$$.name = $2;
+$$.type = NULL;
+$$.stmt = $4;
+}
+	;
+
 statements: /*EMPTY*/
 | statements statement
 		;
diff --git a/src/interfaces/ecpg/preproc/ecpg.type b/src/interfaces/ecpg/preproc/ecpg.type
index 519b737dde..1c68095274 100644
--- a/src/interfaces/ecpg/preproc/ecpg.type
+++ b/src/interfaces/ecpg/preproc/ecpg.type
@@ -145,3 +145,5 @@
 %type var_type
 
 %type   action
+
+%type  PrepareStmt
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 6f67a5e942..cafca22f7a 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -64,6 +64,7 @@ my %replace_types = (
 	'stmtblock'  => 'ignore',
 	'stmtmulti'  => 'ignore',
 	'CreateAsStmt'   => 'ignore',
+	'PrepareStmt'	 => 'ignore',
 	'DeallocateStmt' => 'ignore',
 	'ColId'  => 'ignore',
 	'type_function_name' => 'ignore',


Re: speeding up planning with partitions

2019-03-01 Thread Amit Langote
Imai-san,

Thanks for testing and sorry it took me a while to reply.

On 2019/02/25 15:24, Imai, Yoshikazu wrote:
> [update_pt_with_joining_another_pt.sql]
> update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b;
> 
> [pgbench]
> pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 postgres
> 
> [results]
> (part_num_rt, part_num_jrt)  master  patched(0001)
> ---  --  -
>   (3, 1024)8.06   5.89
>   (3, 2048)1.52   0.87
>   (6, 1024)4.11   1.77
> 
> 
> 
> With HEAD, we add target inheritance and source inheritance to parse->rtable 
> in inheritance_planner and copy and adjust it for child tables at beginning 
> of each planning of child tables.
> 
> With the 0001 patch, we add target inheritance to parse->rtable in 
> inheritance_planner and add source inheritance to parse->rtable in 
> make_one_rel(under grouping_planner()) during each planning of child tables.
> Adding source inheritance to parse->rtable may be the same process between 
> each planning of child tables and it might be useless. OTOH, from the POV of 
> making inheritance-expansion-at-bottom better, expanding source inheritance 
> in make_one_rel seems correct design to me.
> 
> How should we do that...?

To solve this problem, I ended up teaching inheritance_planner to reuse
the objects for *source* inheritance child relations (RTEs,
AppendRelInfos, and PlanRowMarks) created during the planning of the 1st
child query for the planning of subsequent child queries.  Set of source
child relations don't change between different planning runs, so it's okay
to do so.  Of course, I had to make sure that query_planner (which is not
in the charge of adding source inheritance child objects) can notice that.

Please find attached updated patches.  Will update source code comments,
commit message and perform other fine-tuning over the weekend if possible.

Thanks,
Amit
From 2f4c085145c3991c8d7160eab0c90d5d3f9d713e Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 26 Oct 2018 16:45:59 +0900
Subject: [PATCH v25 1/2] Lazy creation of RTEs for inheritance children

Rewrite inherit.c so that source inheritance could be expanded
during query_planner().  That's better because for partitioned
tables, we can add only the partitions left after performing
partition pruning.

Since it's not known what RowMark identities are necessary for
individual inheritance children until they're added to the query,
preprocess_targetlist must delay adding junk columns if there are
any parent PlanRowMarks.

For partitioning, although we don't create a RangeTblEntry and
RelOptInfo for pruned partitions at make_one_rel time, partitionwise
join code relies on the fact that even though partitions may have
been pruned, they'd still own a RelOptInfo to handle the outer join
case where the pruned partition appears on the nullable side of join.
Partitionwise join code deals with that by allocating dummy
RelOptInfos for pruned partitions that are based mostly on their
parent's properties.

There are a few regression test diffs:

 1. Caused by the fact that we no longer allocate a duplicate RT
entry for a partitioned table in its role as child, as seen in
the partition_aggregate.out test output.

 2. Those in postgres_fdw.out are caused by the fact that junk columns
required for row marking are added to reltarget->exprs later than
user columns, because the row marking junk columns aren't added
until the inheritance is expanded which as of this commit is
later than it used to be as noted above.
---
 contrib/postgres_fdw/expected/postgres_fdw.out|  24 +-
 src/backend/optimizer/path/allpaths.c | 202 +--
 src/backend/optimizer/path/joinrels.c |  92 +++-
 src/backend/optimizer/plan/initsplan.c|  53 +-
 src/backend/optimizer/plan/planmain.c |  15 +-
 src/backend/optimizer/plan/planner.c  | 271 +++---
 src/backend/optimizer/plan/setrefs.c  |   6 +
 src/backend/optimizer/prep/preptlist.c| 130 +++--
 src/backend/optimizer/util/inherit.c  | 620 +-
 src/backend/optimizer/util/plancat.c  |  18 +-
 src/backend/optimizer/util/relnode.c  | 359 +++--
 src/backend/partitioning/partprune.c  |  44 +-
 src/include/nodes/pathnodes.h |   7 +
 src/include/optimizer/inherit.h   |   5 +-
 src/include/optimizer/pathnode.h  |   3 +
 src/include/optimizer/plancat.h   |   4 +-
 src/include/optimizer/planmain.h  |   1 +
 src/include/optimizer/prep.h  |   2 +
 src/test/regress/expected/partition_aggregate.out |   4 +-
 19 files changed, 1185 insertions(+), 675 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index

Re: speeding up planning with partitions

2019-03-01 Thread Amit Langote
On 2019/03/01 22:01, Amit Langote wrote:
> Of course, I had to make sure that query_planner (which is not
> in the charge of adding source inheritance child objects) can notice that.

Oops, I meant to write "query_planner (which *is* in the charge of adding
source inheritance child objects)..."

Thanks,
Amit




Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread David Steele

On 3/1/19 1:13 PM, Peter Eisentraut wrote:

Please follow the 1-space indentation in the documentation files.


Whoops.  Will fix.


I think the style of mentioning all the problems in a note before the
actual description is a bit backwards.


In the case of an important note like this I think it should be right at 
the top where people will see it.  Not everyone reads to the end.


--
-David
da...@pgmasters.net



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Magnus Hagander
On Fri, Mar 1, 2019 at 2:07 PM David Steele  wrote:

> On 3/1/19 1:13 PM, Peter Eisentraut wrote:
> > Please follow the 1-space indentation in the documentation files.
>
> Whoops.  Will fix.
>
> > I think the style of mentioning all the problems in a note before the
> > actual description is a bit backwards.
>
> In the case of an important note like this I think it should be right at
> the top where people will see it.  Not everyone reads to the end
>

Maybe have the first note say "This method is deprecated bceause it has
serious risks (see bellow)" and then list the actual risks at the end?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: pgsql: Build src/port files as a library with -fPIC, and use that in li

2019-03-01 Thread Christoph Berg
Re: Tom Lane 2019-01-31 <12792.1548965...@sss.pgh.pa.us>
> initdb hasn't depended on those libpq exports since 8.2, and it was
> a bug that it did so even then.

9.2 psql (and createuser, ...) is also broken:

/usr/lib/postgresql/9.2/bin/psql: symbol lookup error: 
/usr/lib/postgresql/9.2/bin/psql: undefined symbol: pqsignal

Could we at least put a compat symbol back?

Christoph



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Laurenz Albe
Magnus Hagander wrote:
> Maybe have the first note say "This method is deprecated bceause it has 
> serious
> risks (see bellow)" and then list the actual risks at the end? 

Good idea.  That may attract the attention of the dogs among the readers.

Yours,
Laurenz Albe




Re: PostgreSQL vs SQL/XML Standards

2019-03-01 Thread Chapman Flack
On 03/01/19 07:15, Ramanarayana wrote:
> Hi,
> I have tested bug fixes provided by all the patches. They are working
> great. I found one minor issue
> 
>  select * from xmltable('*' PASSING 'pre arg?>&deeppost' COLUMNS x XML PATH '/');
> 
> The above query returns the xml. But there is an extra plus symbol at the
> end
> 
> pre&deeppost+

Hi,

Are you sure that isn't the + added by psql when displaying a value
that contains a newline?

What happens if you repeat the query but after the psql command

  \a

to leave the output unaligned?

Thanks!
-Chap



Re: PostgreSQL vs SQL/XML Standards

2019-03-01 Thread Ramanarayana
Hi,
Yes it is working fine with \a option in psql.

Cheers
Ram 4.0


Re: pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

2019-03-01 Thread Tom Lane
Suraj Kharage  writes:
> The Commit 5955d934194c3888f30318209ade71b53d29777f has changed the logic
> to avoid dumping creation and comment commands for the public schema.

Yup.

> As reported by Prabhat, if we try to restore the custom/tar dump taken from
> v10 and earlier versions, we get the reported error for public schema.

Yes.  We're not intending to do anything about that.  The previous scheme
also caused pointless errors in some situations, so this isn't really a
regression.  The area is messy enough already that trying to avoid errors
even with old (wrong) archives would almost certainly cause more problems
than it solves.  In particular, it's *not* easy to fix things in a way
that works conveniently for both superuser and non-superuser restores.
See the mail thread referenced by 5955d9341.

(Note that it's only been very recently that anyone had any expectation
that pg_dump scripts could be restored with zero errors in all cases;
the usual advice was just to ignore noncritical errors.  I'm not that
excited about it if the old advice is still needed when dealing with old
archives.)

regards, tom lane



Re: partitioned tables referenced by FKs

2019-03-01 Thread Jesper Pedersen

Hi Alvaro,

On 2/28/19 1:28 PM, Alvaro Herrera wrote:

Rebased to current master.  I'll reply later to handle the other issues
you point out.



Applying with hunks.

With 0003 using

export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && 
CC="ccache gcc" ./configure --prefix /usr/local/packages/postgresql-12.x 
--enable-dtrace --with-openssl --with-gssapi --with-libxml --with-llvm 
--enable-debug --enable-depend --enable-tap-tests --enable-cassert


I'm getting a failure in the pg_upgrade test:

 --
+-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: 
jpedersen

+--
+
+ALTER TABLE ONLY regress_fk.pk5
+ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
+
+
+--

when running check-world.

Best regards,
 Jesper



Re: propagating replica identity to partitions

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 6:13 PM Alvaro Herrera  wrote:
> Tablespaces already behave a little bit especially in another sense:
> it doesn't recurse to indexes.  I think it's not possible to implement a
> three-way flag, where you tell it to move only the table, or to recurse
> to child tables but leave local indexes alone, or just to move
> everything.  If we don't move the child indexes, are we really recursing
> in that sense?

I don't quite follow this.  If you want to change the default for new
partitions, you would use ONLY, which updates the value for the parent
(which is cheap) but doesn't touch the children.  If you want to move
it all, you would omit ONLY.  I'm not sureI quite understand the third
case - what do you mean by moving the child tables but leaving the
local indexes alone?

> I think changing the behavior of tablespaces is likely to cause pain for
> people with existing scripts that expect the command not to recurse.  We
> would have to add a switch of some kind to provide the old behavior in
> order not to break those, and that doesn't seem so good either.
>
> I agree we definitely want to treat a partitioned table as similarly as
> possible to a non-partitioned table, but in the particular case of
> tablespace it seems to me better not to mess with the behavior.

You may be right, but I'm not 100% convinced.  I don't understand why
changing the behavior for tablespaces would be a grant shocker and
changing the behavior for OWNER TO would be a big nothing.  If you
mess up the first one, you'll realize it's running for too long and go
"oh, oops" and fix it next time.  If you mess up the second one,
you'll have a security hole, be exploited by hackers, suffer civil and
criminal investigations due to a massive data security breach, and end
your life in penury and in prison, friendless and alone.  Or maybe
not, but the point is that BOTH of these things have an established
behavior such that changing it may surprise some people, and actually
I would argue that the surprise for the TABLESPACE will tend to be
more obvious and less likely to have subtle, unintended consequences.

I hate to open a can of worms here, but there's also the relationship
between OWNER TO and GRANT/REVOKE; that might also need a little
thought.

> CLUSTER: I'm not sure about this one.  For legacy inheritance there's
> no principled way to figure out the right index to recurse to.  For
> partitioning that seems a very simple problem, but I do wonder if
> there's any use case at all for ALTER TABLE CLUSTER there.  My
> inclination would be to reject ALTER TABLE CLUSTER on a partitioned
> table altogether, if it isn't already.

Hmm, I disagree.  I think this should work and set the value for the
whole hierarchy.  That seems well-defined and useful -- why is it any
less useful than for a standalone table?

> OWNER: let's just change this one to recurse always.  I think we have a
> consensus about this one.

We really don't have many votes either way, AFAICS.  I'm not direly
opposed, but I'd like to see a more vigorous attempt to make them ALL
recurse rather than changing one per release for the next 8 years.

> TRIGGER: I think it would be good to recurse, based on the trigger name.
> I'm not sure what to do about legacy inheritance; the trigger isn't
> likely to be there.  An option is to silently ignore each inheritance
> child (not partition) if the trigger isn't present on it.

Hmm, I think this one should maybe recurse to triggers that cascaded
downwards, which would of its nature apply to partitioning but not to
inheritance.

> We all seem to agree that REPLICA IDENTITY should recurse.

My feeling on this is the same as for OWNER.

> Maybe ADD GENERATED AS IDENTITY / DROP IDENTITY should recurse; not
> really sure about this one.  Peter?

I don't know enough about this to have an opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 5:47 AM Etsuro Fujita
 wrote:
> Robert, I CCed you because you are the author of that commit.  Before
> that commit ("Rewrite the code that applies scan/join targets to
> paths."), apply_scanjoin_target_to_paths() had a boolean parameter named
> modify_in_place, and used apply_projection_to_path(), not
> create_projection_path(), to adjust scan/join paths when modify_in_place
> was true, which allowed us to save cycles at plan creation time by
> avoiding creating projection paths, which I think would be a good thing,
> but that commit removed that.  Why?

One of the goals of the commit was to properly account for the cost of
computing the target list.  Before parallel query and partition-wise
join, it didn't really matter what the cost of computing the target
list was, because every path was going to have to do the same work, so
it was just a constant factor getting added to every path.  However,
parallel query and partition-wise join mean that some paths can
compute the final target list more cheaply than others, and that turns
out to be important for things like PostGIS.  One of the complaints
that provoked that change was that PostGIS was picking non-parallel
plans even when a parallel plan was substantially superior, because it
wasn't accounting for the fact that in the parallel plan, the cost of
computing the target-list could be shared across all the workers
rather than paid entirely by the leader.

In order to accomplish this goal of properly accounting for the cost
of computing the target list, we need to create a new path, not just
jam the target list into an already-costed path.  Note that we did
some performance optimization around the same time to minimize the
performance hit here (see d7c19e62a8e0a634eb6b29f8fd944e57081f,
and I think there may have been something else as well although I
can't find it right now).

> The real reason for this question is: I noticed that projection paths
> put on foreign paths will make it hard for FDWs to detect whether there
> is an already-well-enough-sorted remote path in the pathlist for the
> final scan/join relation as an input relation to GetForeignUpperPaths()
> for the UPPERREL_ORDERED step (the IsA(path, ForeignPath) test would not
> work well enough to detect remote paths!), so I'm wondering whether we
> could revive that parameter like the attached, to avoid the overhead at
> plan creation time and to make the FDW work easy.  Maybe I'm missing
> something, though.

I think this would be a bad idea, for the reasons explained above.  I
also think that it's probably the wrong direction on principle.  I
think the way we account for target lists is still pretty crude and
needs to be thought of as more of a real planning step and less as
something that we can just ignore when it's inconvenient for some
reason.  I think the FDW just needs to look through the projection
path and see what's underneath, but also take the projection path's
target list into account when decide whether more can be pushed down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/03/01 20:00), Antonin Houska wrote:
> > Etsuro Fujita  wrote:

> > I used gdb to help me understand, however the condition
> >
> > if (fpextra&&  !IS_UPPER_REL(foreignrel))
> >
> > never evaluated to true with the query above.
> 
> Sorry, my explanation was not enough again, but I showed that query ("SELECT
> a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;") to explain why
> the following code bit is needed:
> 
> +   /*
> +* If this includes an UPPERREL_ORDERED step, the given target, which
> +* would be the final target to be applied to the resulting path,
> might
> +* have different expressions from the underlying relation's reltarget
> +* (see make_sort_input_target()); adjust tlist eval costs.
> +*/
> +   if (fpextra&&  fpextra->target != foreignrel->reltarget)
> +   {
> +   QualCostoldcost = foreignrel->reltarget->cost;
> +   QualCostnewcost = fpextra->target->cost;
> +
> +   startup_cost += newcost.startup - oldcost.startup;
> +   total_cost += newcost.startup - oldcost.startup;
> +   total_cost += (newcost.per_tuple - oldcost.per_tuple) * rows;
> +   }

Maybe I undestand now. Do the expressions (newcost.* - oldcost.*) reflect the
fact that, for the query

SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;

the UPPERREL_ORDERED stage only needs to evaluate the random() function
because (a + b) was already evaluated during the UPPERREL_GROUP_AGG stage?

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Andy Fan
Thank you Michael!

 What can I do if I'm sure I will not use the CTAS creation ?   Take a look
at the "heap_create_with_catalog" function, it check it and raise error.
 Even I  change it to  "check it && if_not_existing",  raise error, it is
still be problematic since we may some other session create between the
check and the real creation end.

Looks we need some locking there, but since PG is processes model,  I even
don't know how to sync some code among processes in PG (any hint on this
will be pretty good as well).

On Fri, Mar 1, 2019 at 8:35 PM Michael Paquier  wrote:

> On Fri, Mar 01, 2019 at 07:17:04PM +0800, Andy Fan wrote:
> > for a createStmt,  it will call transformCreateStmt,  and then
> > heap_create_with_catalog.
> > but looks it just check the if_not_exists in transformCreateStmt.
> >
> > is it designed as this on purpose or is it a bug?
>
> That's a bug.  Andreas Karlsson and I have been discussing it a couple
> of days ago actually:
> https://www.postgresql.org/message-id/20190215081451.gd2...@paquier.xyz
>
> Fixing this is not as straight-forward as it seems, as it requires
> shuffling a bit the code related to a CTAS creation so as all code
> paths check at the same time for an existing relation.  Based on my
> first impressions, I got the feeling that it would be rather invasive
> and not worth a back-patch.
> --
> Michael
>


Re: SQL/JSON: JSON_TABLE

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 8:19 PM Nikita Glukhov  wrote:
> Attached 34th version of the patches.

Kinda strange version numbering -- the last post on this thread is v21.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: FETCH FIRST clause PERCENT option

2019-03-01 Thread Tomas Vondra


On 3/1/19 2:31 AM, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra 
>  wrote in 
> 
>>> One biggest issue seems to be we don't know the total number of
> 
> # One *of* the biggest *issues*?
> 
>>> outer tuples before actually reading a null tuple. I doubt of
>>> general shortcut for that. It also seems preventing limit node
>>> from just using materialized outer.
>>>
>>
>> Sure, if you actually want all tuples, you'll have to execute the outer
>> plan till completion. But that's not what I'm talking about - what if we
>> only ever need to read one row from the limit?
> 
> We have no choice than once reading all tuples just to find we
> are to return just one row, since estimator is not guaranteed to
> be exact as required for this purpose.
> 

I don't follow - I'm not suggesting to base this on row estimates at
all. I'm saying that we can read in rows, and decide how many rows we
are expected to produce.

For example, with 1% sample, we'll produce about 1 row for each 100 rows
read from the outer plan. So we read the first row, and compute

  rows_to_produce = ceil(0.01 * rows_read)

and every time this increases, we emit another row from a tuplestore.


>> To give you a (admittedly, somewhat contrived and artificial example):
>>
>> SELECT * FROM t1 WHERE id IN (
>>   SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
>> );
>>
>> Maybe this example is bogus and/or does not really matter in practice. I
>> don't know, but I've been unable to convince myself that's the case.
> 
> I see such kind of idiom common. Even in the quite simple example
> above, *we* cannot tell how many tuples the inner should return
> unless we actually fetch all tuples in t2. This is the same
> problem with count(*).
> 
> The query is equivalent to the folloing one.
> 
>  SELECT * FROM t1 WHERE id IN (
>SELECT id FROM t2 ORDER BY x
>  FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
>  );
> 
> This scans t2 twice, but this patch does only one full scan
> moving another partial scan to tuplestore. We would win if the
> outer is complex enough.
> 

But what if all t1 rows match the very first row in the subselect? In
that case we don't need to read all rows from the subplan - we only need
to read the first chunk (until we emit the first row).

> Anyway, even excluding the count(*) issue, it seems that we are
> not alone in that point. (That is, at least Oracle shows
> different E-Rows and A-Rows for PERCENT).
> 

I'm not sure hat E-Rows and A-Rows are? I suppose that's estimated and
actual - but as I explained above, that's not what I propose the
incremental approach to use.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: NOT IN subquery optimization

2019-03-01 Thread Andres Freund
Hi,

On March 1, 2019 4:53:03 AM PST, David Rowley  
wrote:
>On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
>> I have reviewed your patch. Good job except two issues I can find:
>>
>> 1. The patch would give wrong results when the inner side is empty.
>In this
>> case, the whole data from outer side should be in the outputs. But
>with the
>> patch, we will lose the NULLs from outer side.
>>
>> 2. Because of the new added predicate 'OR (var is NULL)', we cannot
>use hash
>> join or merge join to do the ANTI JOIN.  Nested loop becomes the only
>choice,
>> which is low-efficency.
>
>Yeah. Both of these seem pretty fundamental, so setting the patch to
>waiting on author.

I've not checked, but could we please make sure these cases are covered in the 
regression tests today with a single liner? Seems people had to rediscover them 
a number of times now, and unless this thread results in an integrated feature 
soonish, it seems likely other people will again.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Prevent extension creation in temporary schemas

2019-03-01 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Feb 28, 2019 at 10:52:52PM -0500, Tom Lane wrote:
>> If you're suggesting that we disable that security restriction
>> during extension creation, I really can't see how that'd be a
>> good thing ...

> No, I don't mean that.  I was just wondering if someone can set
> search_path within the SQL script which includes the extension
> contents to bypass the restriction and the error.  They can always
> prefix such objects with pg_temp anyway if need be...

You'd have to look in namespace.c to be sure, but I *think* that
we don't consult the temp schema during function/operator lookup
even if it's explicitly listed in search_path.

It might be possible for an extension script to get around this with
code like, say,

CREATE TRIGGER ... EXECUTE PROCEDURE @extschema@.myfunc();

although you'd have to give up relocatability of the extension
to use @extschema@.  (Maybe it was a bad idea to not provide
that symbol in relocatable extensions?  A usage like this doesn't
prevent the extension from being relocated later.)

regards, tom lane



Re: pg_partition_tree crashes for a non-defined relation

2019-03-01 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Feb 28, 2019 at 11:50:16PM -0500, Tom Lane wrote:
>> But, having said that, we've learned that it's generally bad for
>> catalog-query functions to fail outright just because they're pointed
>> at the wrong kind of catalog object.  So I think that what we want here
>> is for pg_partition_tree to return NULL or an empty set or some such
>> for a plain table, while its output for a childless partitioned table
>> should be visibly different from that.  I'm not wedded to details
>> beyond that idea.

> Yep, that's the intention since cc53123.  I won't come back to return
> an ERROR in any case.  Here is what the patch gives for childless
> partitions FWIW:
> =# CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
> CREATE TABLE
> =# SELECT * FROM pg_partition_tree('ptif_test');
>relid   | parentrelid | isleaf | level
> ---+-++---
>  ptif_test | null| f  | 0
> (1 row)

Right, while you'd get zero rows out for a non-partitioned table.
WFM.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> On March 1, 2019 4:53:03 AM PST, David Rowley  
> wrote:
>> On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
>>> 1. The patch would give wrong results when the inner side is empty.
>>> 2. Because of the new added predicate 'OR (var is NULL)', we cannot
>>> use hash join or merge join to do the ANTI JOIN.

> I've not checked, but could we please make sure these cases are covered
> in the regression tests today with a single liner?

I'm not sure if the second one is actually a semantics bug or just a
misoptimization?  But yeah, +1 for putting in some simple tests for
corner cases right now.  Anyone want to propose a specific patch?

regards, tom lane



Re: SQL/JSON: JSON_TABLE

2019-03-01 Thread Nikita Glukhov

On 01.03.2019 19:17, Robert Haas wrote:


On Thu, Feb 28, 2019 at 8:19 PM Nikita Glukhov  wrote:

Attached 34th version of the patches.

Kinda strange version numbering -- the last post on this thread is v21.


For simplicity of dependence tracking, version numbering of JSON_TABLE patches
matches the version numbering of the patches on which it  depends -- jsonpath
and SQL/JSON.  The corresponding jsonpath patch has version v34 now.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: readdir is incorrectly implemented at Windows

2019-03-01 Thread Andrew Dunstan


On 2/25/19 10:38 AM, Konstantin Knizhnik wrote:
> Hi hackers,
>
> Small issue with readir implementation for Windows.
> Right now it returns ENOENT in case of any error returned by
> FindFirstFile.
> So all places in Postgres where opendir/listdir are used will assume
> that directory is empty and
> do nothing without reporting any error.
> It is not so good if directory is actually not empty but there are not
> enough permissions for accessing the directory and FindFirstFile
> returns ERROR_ACCESS_DENIED:
>
> struct dirent *
> readdir(DIR *d)
> {
>     WIN32_FIND_DATA fd;
>
>     if (d->handle == INVALID_HANDLE_VALUE)
>     {
>         d->handle = FindFirstFile(d->dirname, &fd);
>         if (d->handle == INVALID_HANDLE_VALUE)
>         {
>             errno = ENOENT;
>             return NULL;
>         }
>     }
>
>
> Attached please find small patch fixing the problem.
>

Diagnosis seems correct. I wonder if this is responsible for some odd
things we've seen over time on Windows.

This reads a bit oddly:


     {
-            errno = ENOENT;
+            if (GetLastError() == ERROR_FILE_NOT_FOUND)
+            {
+                /* No more files, force errno=0 (unlike mingw) */
+                errno = 0;
+                return NULL;
+            }
+            _dosmaperr(GetLastError());
         return NULL;
     }


Why not something like:


if (GetLastError() == ERROR_FILE_NOT_FOUND)
    errno = 0;
else
    _dosmaperr(GetLastError());
return NULL;


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 23:17:27 +1300, Thomas Munro wrote:
> @@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
>  * the REDO pointer.  Note that smgr must not do anything that'd have 
> to
>  * be undone if we decide no checkpoint is needed.
>  */
> -   smgrpreckpt();
> +   PreCheckpoint();
> 
> I would call this and the "post" variant something like
> SyncPreCheckpoint().  Otherwise it's too general sounding and not
> clear which module it's in.

Definitely.

> +typedef enum syncrequesttype
> +{
> +   SYNC_REQUEST,
> +   FORGET_REQUEST,
> +   FORGET_HIERARCHY_REQUEST,
> +   UNLINK_REQUEST
> +} syncrequesttype;
> 
> These names are too generic for the global C namespace; how about
> SYNC_REQ_CANCEL or similar?
> 
> +typedef enum syncrequestowner
> +{
> +   SYNC_MD = 0 /* md smgr */
> +} syncrequestowner;
> 
> I have a feeling that Andres wanted to see a single enum combining
> both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> have it.

Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.

FWIW, I don't like the name owner here. Class? Method?

Greetings,

Andres Freund



Re: Minimal logical decoding on standbys

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 13:33:23 +0530, tushar wrote:
> While testing  this feature  found that - if lots of insert happened on the
> master cluster then pg_recvlogical is not showing the DATA information  on
> logical replication slot which created on SLAVE.
> 
> Please refer this scenario -
> 
> 1)
> Create a Master cluster with wal_level=logcal and create logical replication
> slot -
>  SELECT * FROM pg_create_logical_replication_slot('master_slot',
> 'test_decoding');
> 
> 2)
> Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D slave/ -v
> -R)  and create logical replication slot -
> SELECT * FROM pg_create_logical_replication_slot('standby_slot',
> 'test_decoding');

So, if I understand correctly you do *not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Thanks,

Andres



Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
Hello,

PostgreSQL FLOAT appears to support +/-Infinity and NaN per the IEEE 754
standard, with expressions such as CAST('NaN' AS FLOAT) and CAST('Infinity'
AS FLOAT) and even supports ordering columns of floats that contain NaN.

However the query "SELECT 1.0/0.0;" produces an exception:

ERROR:  division by zero


Question: If Infinity and NaN are supported, then why throw an exception
here, instead of returning Infinity? Is it purely for historical reasons,
or if it could all be done again, would an exception still be preferred?

For purely integer arithmetic, I can see how an exception would make sense.
However for FLOAT, I would expect/prefer Infinity to be returned.

Best regards,
Matt


Re: Infinity vs Error for division by zero

2019-03-01 Thread Andrew Gierth
> "Matt" == Matt Pulver  writes:

 Matt> ERROR:  division by zero

 Matt> Question: If Infinity and NaN are supported, then why throw an
 Matt> exception here, instead of returning Infinity?

Spec says so:

  4) The dyadic arithmetic operators , ,
 , and  (+, -, *, and /, respectively) specify
 addition, subtraction, multiplication, and division, respectively.
 If the value of a divisor is zero, then an exception condition is
 raised: data exception -- division by zero.

-- 
Andrew (irc:RhodiumToad)



Re: Infinity vs Error for division by zero

2019-03-01 Thread David G. Johnston
On Friday, March 1, 2019, Matt Pulver  wrote:

> However the query "SELECT 1.0/0.0;" produces an exception:
>
> ERROR:  division by zero
>
>
> Question: If Infinity and NaN are supported, then why throw an exception
> here, instead of returning Infinity? Is it purely for historical reasons,
> or if it could all be done again, would an exception still be preferred?
>
> For purely integer arithmetic, I can see how an exception would make
> sense. However for FLOAT, I would expect/prefer Infinity to be returned.
>

1/0 is an illegal operation.  We could return NaN for it but the choice of
throwing an error is just as correct.  Returning infinity is strictly
incorrect.

Changing the behavior is not going to happen for any existing data types.

David J.


Re: Infinity vs Error for division by zero

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 12:46:55 -0500, Matt Pulver wrote:
> PostgreSQL FLOAT appears to support +/-Infinity and NaN per the IEEE 754
> standard, with expressions such as CAST('NaN' AS FLOAT) and CAST('Infinity'
> AS FLOAT) and even supports ordering columns of floats that contain NaN.
> 
> However the query "SELECT 1.0/0.0;" produces an exception:
> 
> ERROR:  division by zero
> 
> 
> Question: If Infinity and NaN are supported, then why throw an exception
> here, instead of returning Infinity? Is it purely for historical reasons,
> or if it could all be done again, would an exception still be preferred?
> 
> For purely integer arithmetic, I can see how an exception would make sense.
> However for FLOAT, I would expect/prefer Infinity to be returned.

It'd be good for performance reasons to not have to check for that, and
for under/overflow. But the historical behaviour has quite some weight,
and there's some language in the standard that can legitimate be
interpreted that both conditions need to be signalled, if I recall
correctly.

Greetings,

Andres Freund



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 12:43 PM Andres Freund  wrote:
> Obviously it's nicer looking this way, but OTOH, that means we have to
> send more data over the queue, because we can't easily combine the
> request + "owner". I don't have too strong feelings about it though.

Yeah, I would lean toward combining those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Log a sample of transactions

2019-03-01 Thread Adrien NAYRAT

Hello,

On 2/15/19 3:24 PM, Adrien NAYRAT wrote:

On 2/14/19 9:14 PM, Andres Freund wrote:

I wonder if this doesn't need a warning, explaining that using this when
there are large transactions could lead to slowdowns.


Yes, I will add some wording


Warning added.


It seems pretty weird to have this in postgres.c, given you declared it
in xact.h?


Yes, I have to revise my C. I will move it to 
src/backend/access/transam/xact.c


Fixed


Why are both of these checked? ISTM once xact_is_sampled is set, we
ought not to respect a different value of log_xact_sample_rate for the
rest of that transaction.


I added theses checks to allow to disable logging during a sampled 
transaction, per suggestion of Masahiko Sawada:
https://www.postgresql.org/message-id/CAD21AoD4t%2BhTV6XfK5Yz%3DEocB8oMyJSYFfjAryCDYtqfib2GrA%40mail.gmail.com 


I added a comment to explain why transaction logging is rechecked.



As far as I can tell xact_is_sampled is not properly reset in case of
errors?





I am not sure if I should disable logging in case of errors. Actually we 
have:


postgres=# set log_transaction_sample_rate to 1;
SET
postgres=# set client_min_messages to 'LOG';
LOG:  duration: 0.392 ms  statement: set client_min_messages to 'LOG';
SET
postgres=# begin;
LOG:  duration: 0.345 ms  statement: begin;
BEGIN
postgres=# select 1;
LOG:  duration: 0.479 ms  statement: select 1;
 ?column?
--
1
(1 row)

postgres=# select * from missingtable;
ERROR:  relation "missingtable" does not exist
LINE 1: select * from missingtable;
  ^
postgres=# select 1;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block

postgres=# rollback;
LOG:  duration: 11390.295 ms  statement: rollback;
ROLLBACK

If I reset xact_is_sampled (after the first error inside 
AbortTransaction if I am right), "rollback" statement will not be 
logged. I wonder if this "statement" should be logged?


If the answer is no, I will reset xact_is_sampled in AbortTransaction.



Patch attached with fix mentioned above, but without xact_is_sampled reset.

Cheers,


diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6d42b7afe7..dbf5b81243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5819,6 +5819,32 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_transaction_sample_rate (real)
+  
+   log_transaction_sample_rate configuration parameter
+  
+  
+   
+
+ Set the fraction of transactions whose statements are logged. It applies
+ to each new transaction regardless of their duration. The default is
+ 0, meaning do not log statements from this transaction.
+ Setting this to 1 logs all statements for all transactions.
+ Logging can be disabled inside a transaction by setting this to 0.
+ log_transaction_sample_rate is helpful to track a
+ sample of transaction.
+
+   
+
+ This option can add overhead when transactions are large (many short queries).
+ Or, when the workload is high and log_transaction_sample_rate
+ is closed to 1.
+
+   
+   
+ 
+
  
 
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..2a301b88ea 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -244,6 +244,9 @@ static char *prepareGID;
  */
 static bool forceSyncCommit = false;
 
+/* Flag for logging statements in a transaction. */
+boolxact_is_sampled = false;
+
 /*
  * Private context for transaction-abort work --- we reserve space for this
  * at startup to ensure that AbortTransaction and AbortSubTransaction can work
@@ -1822,6 +1825,9 @@ StartTransaction(void)
 	s->state = TRANS_START;
 	s->transactionId = InvalidTransactionId;	/* until assigned */
 
+	/* Determine if statements are logged in this transaction */
+	xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE;
+
 	/*
 	 * initialize current transaction state fields
 	 *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..76f7f8a912 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2203,6 +2203,8 @@ check_log_statement(List *stmt_list)
  * check_log_duration
  *		Determine whether current command's duration should be logged.
  *		If log_statement_sample_rate < 1.0, log only a sample.
+ *		We also check if this statement in this transaction must be logged
+ *		(regardless of its duration).
  *
  * Returns:
  *		0 if no logging is needed
@@ -2218,7 +2220,8 @@ check_log_statement(List *stmt_list)
 int
 check_log_duration(char *msec_str, bool was_logged)
 {
-	if (log_duration || log_min_duration_statement >= 0)
+	if (log_duration || log_min_duration_statement >= 0 ||
+		(xact_is_sampled && log_xact_sample_rate > 0))
 	{
 		long		secs;
 		int			usecs;
@@ -2252,11 +2255,17 @@ check_l

Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 1:04 PM, David G. Johnston wrote:

> 1/0 is an illegal operation.  We could return NaN for it but the choice of
> throwing an error is just as correct.  Returning infinity is strictly
> incorrect.

That differs from my understanding of how the operations are specified
in IEEE 754 (as summarized in, e.g., [1]).

Andrew posted the relevant part of the SQL spec that requires the
operation to raise 22012.

That's a requirement specific to SQL (which is, of course, what matters
here.)

But if someone wanted to write a user-defined division function or
operator that would return Inf for (anything > 0) / 0 and for
(anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
(anything > 0) / -0, and NaN for (either zero) / (either zero), I think
that function or operator would be fully in keeping with IEEE 754.

-Chap


[1] https://steve.hollasch.net/cgindex/coding/ieeefloat.html#operations



Re: Infinity vs Error for division by zero

2019-03-01 Thread Andres Freund
On 2019-03-01 11:04:04 -0700, David G. Johnston wrote:
> Changing the behavior is not going to happen for any existing data types.

For the overflow case that really sucks, because we're leaving a very
significant amount of performance on the table because we recheck for
overflow in every op. The actual float operation is basically free, but
the overflow check and the calling convention is not. JIT can get of the
latter, but not the former. Which is why we spend like 30% in one of the
TPCH queries doing overflow checks...

I still kinda wonder whether we can make trapping operations work, but
it's not trivial.



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:17 AM Tsunakawa, Takayuki
 wrote:
> Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
> I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
> maker...

Yes, and "attractive nuisance" means something that, superficially, it
looks like a good idea, but later, you find out that it creates many
problems.

I want to make one other point about this patch, which is that over on
the thread "New vacuum option to do only freezing" we have a patch
that does a closely-related thing.  Both patches skip one phase of the
overall VACUUM process.  THIS patch wants to skip truncation; THAT
patch wants to skip index cleanup.  Over there, we seem to have
settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
-- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
only available as a reloption.

Now that seems not very consistent.  One can only be set as a
reloption, the other only as a VACUUM option.  One talks about what is
enabled, the other about what is disabled.  One puts enable/disable at
the start of the name, the other at the end.

My proposal would be that we make both options available as both
reloptions and vacuum options.  Call the VACUUM options INDEX_CLEANUP
and TRUNCATE and the default will be true but the user can specify
false.  For the reloptions, prefix "vacuum_", thus
vacuum_index_cleanup = true/false and vacuum_truncate = true/false.
If that doesn't appeal, I am open to other ideas how to make this
consistent, but I think it should in some way be made consistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: New vacuum option to do only freezing

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:12 AM Masahiko Sawada  wrote:
> Attached the updated version patch.

Regarding the user interface for this patch, please have a look at the
concerns I mention in

https://www.postgresql.org/message-id/ca+tgmozorx_uuv67rjasx_aswkdzwv8kwfkfrwxyldcqfqj...@mail.gmail.com

I provided a suggestion there as to how to resolve the issue but
naturally others may feel differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Block level parallel vacuum

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 12:19 AM Masahiko Sawada  wrote:
> > I wonder if we really want this behavior.  Should a setting that
> > controls the degree of parallelism when scanning the table also affect
> > VACUUM?  I tend to think that we probably don't ever want VACUUM of a
> > table to be parallel by default, but rather something that the user
> > must explicitly request.  Happy to hear other opinions.  If we do want
> > this behavior, I think this should be written differently, something
> > like this: The PARALLEL N option to VACUUM takes precedence over this
> > option.
>
> For example, I can imagine a use case where a batch job does parallel
> vacuum to some tables in a maintenance window. The batch operation
> will need to compute and specify the degree of parallelism every time
> according to for instance the number of indexes, which would be
> troublesome. But if we can set the degree of parallelism for each
> tables it can just to do 'VACUUM (PARALLEL)'.

True, but the setting in question would also affect the behavior of
sequential scans and index scans.  TBH, I'm not sure that the
parallel_workers reloption is really a great design as it is: is
hard-coding the number of workers really what people want?  Do they
really want the same degree of parallelism for sequential scans and
index scans?  Why should they want the same degree of parallelism also
for VACUUM?  Maybe they do, and maybe somebody explain why they do,
but as of now, it's not obvious to me why that should be true.

> Since the parallel vacuum uses memory in the same manner as the single
> process vacuum it's not deteriorated. I'd agree that that patch is
> more smarter and this patch can be built on top of it but I'm
> concerned that there two proposals on that thread and the discussion
> has not been active for 8 months. I wonder if  it would be worth to
> think of improving the memory allocating based on that patch after the
> parallel vacuum get committed.

Well, I think we can't just say "oh, this patch is going to use twice
as much memory as before," which is what it looks like it's doing
right now. If you think it's not doing that, can you explain further?

> Agreed. I'll separate patches and propose it.

Cool.  Probably best to keep that on this thread.

> The main motivations are refactoring and improving readability but
> it's mainly for the previous version patch which implements parallel
> heap vacuum. It might no longer need here. I'll try to implement
> without LVState. Thank you.

Oh, OK.

> > + /*
> > + * If there is already-updated result in the shared memory we use it.
> > + * Otherwise we pass NULL to index AMs, meaning it's first time call,
> > + * and copy the result to the shared memory segment.
> > + */
> >
> > I'm probably missing something here, but isn't the intention that we
> > only do each index once?  If so, how would there be anything there
> > already?  Once from for_cleanup = false and once for for_cleanup =
> > true?
>
> We call ambulkdelete (for_cleanup = false) 0 or more times for each
> index and call amvacuumcleanup (for_cleanup = true) at the end. In the
> first time calling either ambulkdelete or amvacuumcleanup the lazy
> vacuum must pass NULL to them. They return either palloc'd
> IndexBulkDeleteResult or NULL. If they returns the former the lazy
> vacuum must pass it to them again at the next time. In current design,
> since there is no guarantee that an index is always processed by the
> same vacuum process each vacuum processes save the result to DSM in
> order to share those results among vacuum processes. The 'updated'
> flags indicates that its slot is used. So we can pass the address of
> DSM if 'updated' is true, otherwise pass NULL.

Ah, OK.  Thanks for explaining.

> Almost comments I got have been incorporated to the local branch but a
> few comments need discussion. I'll submit the updated version patch
> once I addressed all of comments.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns

2019-03-01 Thread Andres Freund
Hi,

On 2019-02-28 10:18:36 -0800, Andres Freund wrote:
> On 2019-02-28 18:28:37 +0900, Etsuro Fujita wrote:
> > > I'm currently
> > > converting the EPQ machinery to slots, and in course of that I (with
> > > Horiguchi-san's help), converted RefetchForeignRow to return a slot. But
> > > there's currently no in-core user of this facility...  I guess I can
> > > rebase the preliminary postgres_fdw patch here, but it bitrotted
> > > significantly.
> > 
> > I'll rebase that patch and help the testing, if you want me to.
> 
> That'd be awesome.

FWIW, I pushed the EPQ patch, doing this conversion blindly. It'd be
awesome if you'd check that it actually works...

Thanks,

Andres



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andrew Dunstan


On 3/1/19 1:43 PM, Robert Haas wrote:
> On Thu, Feb 28, 2019 at 3:17 AM Tsunakawa, Takayuki
>  wrote:
>> Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
>> I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
>> maker...
> My proposal would be that we make both options available as both
> reloptions and vacuum options.  


+many


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andres Freund
Hi,

On 2019-02-27 10:55:49 -0500, Robert Haas wrote:
> I don't think that a VACUUM option would be out of place, but a GUC
> sounds like an attractive nuisance to me.  It will encourage people to
> just flip it blindly instead of considering the particular cases where
> they need that behavior, and I think chances are good that most people
> who do that will end up being sad.

OTOH, as the main reason for wanting to disable truncation is that a
user is getting very undesirable HS conflicts, it doesn't seem right to
force them to change the reloption on all tables, and then somehow force
it to be set on all tables created at a later stage. I'm not sure how
that'd be better?

Greetings,

Andres Freund



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Robert Haas  writes:
> I want to make one other point about this patch, which is that over on
> the thread "New vacuum option to do only freezing" we have a patch
> that does a closely-related thing.  Both patches skip one phase of the
> overall VACUUM process.  THIS patch wants to skip truncation; THAT
> patch wants to skip index cleanup.  Over there, we seem to have
> settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
> -- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
> only available as a reloption.

> Now that seems not very consistent.

Indeed, but I'm not sure that the use-cases are the same.  In particular,
unless somebody has done some rather impossible magic, it would be
disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
it would be persistent and you'd never get a real vacuum operation and
soon your disk would be full.  Permanently applying truncation disabling
seems less insane.

The gratuitously inconsistent spellings should be harmonized, for sure.

regards, tom lane



Re: [HACKERS] CLUSTER command progress monitor

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 11:54 PM Tatsuro Yamada
 wrote:
> Attached patch is wip patch.

+   CLUSTER and VACUUM FULL,
showing current progress.

and -> or

+   certain commands during command execution.  Currently, the suppoted
+   progress reporting commands are VACUUM and
CLUSTER.

suppoted -> supported

But I'd just say: Currently, the only commands which support progress
reporting are VACUUM and
CLUSTER.

+   Running VACUUM FULL is listed in
pg_stat_progress_cluster
+   view because it uses CLUSTER command
internally.  See .

How about: Running VACUUM FULL is listed in
pg_stat_progress_cluster because both
VACUUM FULL and CLUSTER rewrite
the table, while regular VACUUM only modifies it in
place.

+   Current processing command: CLUSTER/VACUUM FULL.

The command that is running.  Either CLUSTER or VACUUM FULL.

+   Current processing phase of cluster/vacuum full.  See  or .

Current processing phase of CLUSTER or VACUUM FULL.

Or maybe better, just abbreviate to: Current processing phase.

+   Scan method of table: index scan/seq scan.

Eh, shouldn't this be gone now?  And likewise for the view definition?

+   OID of the index.

If the table is being scanned using an index, this is the OID of the
index being used; otherwise, it is zero.

+ heap_tuples_total

Leftovers.  Skipping over the rest of your documentation changes since
it looks like a bunch of things there still need to be updated.

+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);

This now appears inside cluster_rel(), but also vacuum_rel() is still
doing the same thing.  That's wrong.

+ if(OidIsValid(indexOid))

Missing space.  Please pgindent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> OTOH, as the main reason for wanting to disable truncation is that a
> user is getting very undesirable HS conflicts, it doesn't seem right to
> force them to change the reloption on all tables, and then somehow force
> it to be set on all tables created at a later stage. I'm not sure how
> that'd be better?

I think we should reject the whole patch, tbh, and go do something
about the underlying problem instead.  Once we've made truncation
not require AEL, this will be nothing but a legacy wart that we'll
have a hard time getting rid of.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 14:17:33 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > OTOH, as the main reason for wanting to disable truncation is that a
> > user is getting very undesirable HS conflicts, it doesn't seem right to
> > force them to change the reloption on all tables, and then somehow force
> > it to be set on all tables created at a later stage. I'm not sure how
> > that'd be better?
> 
> I think we should reject the whole patch, tbh, and go do something
> about the underlying problem instead.  Once we've made truncation
> not require AEL, this will be nothing but a legacy wart that we'll
> have a hard time getting rid of.

IDK, it's really painful in the field, and I'm not quite seeing us
getting rid of the AEL for v12. I think it's a wart, but one that works
around a pretty important usability issue. And I think we should just
remove the GUC without any sort of deprecation afterwards, if necessary
we can add a note to the docs to that effect.  It's not like preventing
truncation from happening is a very intrusive / dangerous thing to do.

Greetings,

Andres Freund



Re: Infinity vs Error for division by zero

2019-03-01 Thread David G. Johnston
On Friday, March 1, 2019, Chapman Flack  wrote:

>
> But if someone wanted to write a user-defined division function or
> operator that would return Inf for (anything > 0) / 0 and for
> (anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
> (anything > 0) / -0, and NaN for (either zero) / (either zero), I think
> that function or operator would be fully in keeping with IEEE 754.
>

Upon further reading you are correct - IEEE 754 has chosen to treat n/0
differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
within the scope of this database, and the core arithmetic functions it
provides, those distinctions don't seeming meaningful and having to add
query logic to deal with both cases would just be annoying.  I don't use,
or have time for the distraction, to understand why such a decision was
made and how it could be useful.  Going from an exception to NaN makes
sense to me, going instead to infinity - outside of limit expressions which
aren't applicable here - does not.

For my part in the queries I have that encounter divide-by-zero I end up
transforming the result to zero which is considerably easier to
present/absorb along side other valid fractions in a table or chart.

David J.


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andrew Dunstan


On 3/1/19 2:14 PM, Tom Lane wrote:
> Robert Haas  writes:
>> I want to make one other point about this patch, which is that over on
>> the thread "New vacuum option to do only freezing" we have a patch
>> that does a closely-related thing.  Both patches skip one phase of the
>> overall VACUUM process.  THIS patch wants to skip truncation; THAT
>> patch wants to skip index cleanup.  Over there, we seem to have
>> settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
>> -- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
>> only available as a reloption.
>> Now that seems not very consistent.
> Indeed, but I'm not sure that the use-cases are the same.  In particular,
> unless somebody has done some rather impossible magic, it would be
> disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
> it would be persistent and you'd never get a real vacuum operation and
> soon your disk would be full.  Permanently applying truncation disabling
> seems less insane.
>
> The gratuitously inconsistent spellings should be harmonized, for sure.
>
>   


You could allow an explicitly set command option to override the reloption.


It's important for us to be able to control the vacuum phases more. In
particular, the index cleanup phase can have significant system impact
but often doesn't need to be done immediately.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> On 2019-03-01 14:17:33 -0500, Tom Lane wrote:
>> I think we should reject the whole patch, tbh, and go do something
>> about the underlying problem instead.  Once we've made truncation
>> not require AEL, this will be nothing but a legacy wart that we'll
>> have a hard time getting rid of.

> IDK, it's really painful in the field, and I'm not quite seeing us
> getting rid of the AEL for v12.

Dunno, I was musing about it just yesterday, in
https://postgr.es/m/1261.1551392...@sss.pgh.pa.us

I'd sure rather spend time making that happen than this.  I'm also
not entirely convinced that we MUST do something about this in v12
rather than v13 --- we've been living with it ever since we had
in-core replication, why's it suddenly so critical?

> I think it's a wart, but one that works
> around a pretty important usability issue. And I think we should just
> remove the GUC without any sort of deprecation afterwards, if necessary
> we can add a note to the docs to that effect.  It's not like preventing
> truncation from happening is a very intrusive / dangerous thing to do.

Well, if we add a reloption then we can never ever get rid of it; at
best we could ignore it.  So from the perspective of how-fast-can-we-
deprecate-this, maybe a GUC is the better answer.  On the other hand,
I'm not sure I believe that many installations could afford to disable
truncation for every single table.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andrew Dunstan  writes:
> On 3/1/19 2:14 PM, Tom Lane wrote:
>> Indeed, but I'm not sure that the use-cases are the same.  In particular,
>> unless somebody has done some rather impossible magic, it would be
>> disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
>> it would be persistent and you'd never get a real vacuum operation and
>> soon your disk would be full.  Permanently applying truncation disabling
>> seems less insane.

> You could allow an explicitly set command option to override the reloption.
> It's important for us to be able to control the vacuum phases more. In
> particular, the index cleanup phase can have significant system impact
> but often doesn't need to be done immediately.

I'm not objecting to having a manual command option to skip index cleanup
(which basically reduces to "do nothing but tuple freezing", right?
maybe it should be named/documented that way).  Applying it as a reloption
seems like a foot-gun, though.

regards, tom lane



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 6:40 PM Tom Lane  wrote:
> 1. Encourage people to develop new patches using chosen-at-random
> high OIDs, in the 7K-9K range.  They do this already, it'd just
> be encouraged instead of discouraged.
>
> 2. Commit patches as received.
>
> 3. Once each devel cycle, after feature freeze, somebody uses the
> renumbering tool to shove all the new OIDs down to lower numbers,
> freeing the high-OID range for the next devel cycle.  We'd have
> to remember to do that, but it could be added to the RELEASE_CHANGES
> checklist.

Sure, that sounds nice.  It seems like it might be slightly less
convenient for non-committers than what I was proposing, but still
more convenient than what they're doing right now.  And it's also more
convenient for committers, because they're not being asked to manually
fiddle patches at the last moment, something that I at least find
rather error-prone.  It also, and I think this is really good, moves
in the direction of fewer things for both patch authors and patch
committers to worry about doing wrong.  Instead of throwing rocks at
people whose OID assignments are "wrong," we just accept what people
do and adjust it later if it makes sense to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 5:36 PM Peter Geoghegan  wrote:
> I have attempted to institute some general guidelines for what the
> thicket of rules are by creating the "committing checklist" page. This
> is necessarily imperfect, because the rules are in many cases open to
> interpretation, often for good practical reasons. I don't have any
> sympathy for committers that find it hard to remember to do a
> catversion bump with any kind of regularity. That complexity seems
> inherent, not incidental, since it's often convenient to ignore
> catalog incompatibilities during development.

Bumping catversion is something of a special case, because one does it
so often that one gets used to remembering it.  The rules are usually
not too hard to remember, although they are trickier when you don't
directly change anything in src/include/catalog but just change the
definition of some node that may be serialized in a catalog someplace.
It would be neat if there were a tool you could run to somehow tell
you whether catversion needs to be changed for a given patch.

But you know there are a log of other version numbers floating around,
like XLOG_PAGE_MAGIC or the pg_dump archive version, and it is not
really that easy to know -- as a new contributor or sometimes even as
an experienced one -- whether your work requires any changes to that
stuff, or even that that stuff *exists*.  Indeed, XLOG_PAGE_MAGIC is a
particularly annoying case, both because the constant name doesn't
contain VERSION and because the comment just says /* can be used as
WAL version indicator */ which does not exactly make it clear that if
you fail to bump it when you touch the WAL format you will Make People
Unhappy.

Indeed, Simon got complaints a number of years ago (2010, it looks
like) when he had the temerity to change the magic number to some
other unrelated value instead of just incrementing it by one.
Although I think that the criticism was to a certain extent
well-founded -- why deviate from previous practice? -- there is at the
same time something a little crazy about somebody getting excited
about the particular value that has been chosen for a number that is
described in the very name of the constant as a MAGIC number.  And
especially because there is absolutely zip in the way of code comments
or a README that explain to you how to do it "right."

> > So, I suspect that for every
> > unit of work it saves somebody, it's probably going to generate about
> > one unit of extra work for somebody else.
>
> Maybe so. I think that you're jumping to conclusions, though.

I did say "I suspect..." which was intended as a concession that I
don't know for sure.

> But you seem to want to make the mechanism itself even more
> complicated, not less complicated (based on your remarks about making
> OID assignment happen during the build). In order to make the use of
> the mechanism easier. That seems worth considering, but ISTM that this
> is talking at cross purposes. There are far simpler ways of making it
> unlikely that a committer is going to miss this step. There is also a
> simple way of noticing that they do quickly (e.g. a simple buildfarm
> test).

Well, perhaps I'm proposing some additional code, but I don't think of
that as making the mechanism more complicated.  I want to make it
simpler for patch submitters and reviewers and committers to not make
mistakes that they have to run around and fix.  If there are fewer
kinds of things that qualify as mistakes, as in Tom's latest proposal,
then we are moving in the right direction IMO regardless of anything
else.

> > OK.  Well, I think that doing nothing is superior to this proposal,
> > for reasons similar to what Peter Eisentraut has already articulated.
> > And I think rather than blasting forward with your own preferred
> > alternative in the face of disagreement, you should be willing to
> > discuss other possible options.  But if you're not willing to do that,
> > I can't make you.
>
> Peter seemed to not want to do this on the grounds that it isn't
> necessary at all, whereas you think that it doesn't go far enough. If
> there is a consensus against what Tom has said, it's a cacophonous one
> that cannot really be said to be in favor of anything.

I think Peter and I are more agreeing than we are at the opposite ends
of a spectrum, but more importantly, I think it is worth having a
discussion first about what people like and dislike, and what goals
they have, and then only if necessary, counting the votes afterwards.
I don't like having the feeling that because I have a different view
of something and want to write an email about that, I am somehow an
impediment to progress.  I think if we reduce discussions to
you're-for-it-or-your-against-it, that's not that helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Thomas Munro
On Sat, Mar 2, 2019 at 8:36 AM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:
> > > >
> > > > +typedef enum syncrequestowner
> > > > +{
> > > > +   SYNC_MD = 0 /* md smgr */
> > > > +} syncrequestowner;
> > > >
> > > > I have a feeling that Andres wanted to see a single enum combining
> > > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> > > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> > > > have it.
> > >
> > > Obviously it's nicer looking this way, but OTOH, that means we have to
> > > send more data over the queue, because we can't easily combine the
> > > request + "owner". I don't have too strong feelings about it though.
> >
> > Yeah, I would lean toward combining those.
>
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values. Not to
> mention that you have multiple enum values for the same behavior - which
> just isn't clean. And handling of these enums in the code would be ugly
> too.
>
> Do note that these are typedef'ed to uint8 currently. For a default
> config with 128 MB shared_buffers, we will use an extra 16kB (one extra
> byte to represent the owner).  I am hesitant to change this right now
> unless folks feel strongly about it.
>
> If so, I would combine the type and owner by splitting it up in 4 bit
> chunks, allowing for 16 request types and 16 smgrs. This change would
> only apply for the in-memory queue. The code and functions would
> continue using the enums.

+1

-- 
Thomas Munro
https://enterprisedb.com



Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 2:26 PM, David G. Johnston wrote:

> Upon further reading you are correct - IEEE 754 has chosen to treat n/0
> differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
> ... I don't use,
> or have time for the distraction, to understand why such a decision was
> made and how it could be useful.

The answer may be as simple as the inherent difference between
the cases.

0/0 is funny because of a uniqueness problem. Try to name q such that
0/0 = q, rewritten as q × 0 = 0, and the problem you run into is that
that's true for any value of q. So you would have to make some
completely arbitrary decision to name any value at all as "the" result.

(anything nonzero)/0 is funny because of a representability problem.
n/0 = q, rewritten as q × 0 = n, only has the problem that it's
untrue for every finite value q; they're never big enough. Calling the
result infinity is again a definitional decision, but this time it
is not an arbitrary one among multiple equally good choices; all
finite choices are ruled out, and the definitional choice is fully
consistent with what you see happening as a divisor *approaches* zero.

> For my part in the queries I have that encounter divide-by-zero I end up
> transforming the result to zero which is considerably easier to
> present

Easy to present it may be, but it lacks the mathematical
motivation behind the choice IEEE made ... as a value for q, zero
fails the q × 0 = n test fairly convincingly for nonzero n. :)

Regards,
-Chap



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values.

How big of an explosion would it be?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Robert Haas
On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada  wrote:
> WAL encryption will follow as an additional feature.

I don't think WAL encryption is an optional feature.  You can argue
about whether it's useful to encrypt the disk files in the first place
given that there's no privilege boundary between the OS user and the
database, but a lot of people seem to think it is and maybe they're
right.  However, who can justify encrypting only SOME of the disk
files and not others?  I mean, maybe you could justify not encryption
the SLRU files on the grounds that they probably don't leak much in
the way of interesting information, but the WAL files certainly do --
your data is there, just as much as in the data files themselves.

To be honest, I think there is a lot to like about the patches
Cybertec has proposed.  Those patches don't have all of the fancy
key-management stuff that you are proposing here, but maybe that
stuff, if we want it, could be added, rather than starting over from
scratch.  It seems to me that those patches get a lot of things right.
In particular, it looked to me when I looked at them like they made a
pretty determined effort to encrypt every byte that might go down to
the disk.  It seems to me that you if you want encryption, you want
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2019 at 11:56 AM Robert Haas  wrote:
> It would be neat if there were a tool you could run to somehow tell
> you whether catversion needs to be changed for a given patch.

That seems infeasible because of stored rules. A lot of things bleed
into that. We could certainly do better at documenting this on the
"committing checklist" page, though.

> Indeed, Simon got complaints a number of years ago (2010, it looks
> like) when he had the temerity to change the magic number to some
> other unrelated value instead of just incrementing it by one.

Off with his head!

> Although I think that the criticism was to a certain extent
> well-founded -- why deviate from previous practice? -- there is at the
> same time something a little crazy about somebody getting excited
> about the particular value that has been chosen for a number that is
> described in the very name of the constant as a MAGIC number.  And
> especially because there is absolutely zip in the way of code comments
> or a README that explain to you how to do it "right."

I have learned to avoid ambiguity more than anything else, because
ambiguity causes patches to flounder indefinitely, whereas it's
usually not that hard to fix something that's broken. I agree -
anything that adds ambiguity rather than taking it away is a big
problem.

> Well, perhaps I'm proposing some additional code, but I don't think of
> that as making the mechanism more complicated.  I want to make it
> simpler for patch submitters and reviewers and committers to not make
> mistakes that they have to run around and fix.

Right. So do I. I just don't think that it's that bad to ask the final
committer to do something once, rather than getting everyone else
(including committers) to do it multiple times. If we can avoid even
this burden, and totally centralize the management of the OID space,
then so much the better.

> If there are fewer
> kinds of things that qualify as mistakes, as in Tom's latest proposal,
> then we are moving in the right direction IMO regardless of anything
> else.

I'm glad that we now have a plan that is a clear step forward.

> I think Peter and I are more agreeing than we are at the opposite ends
> of a spectrum, but more importantly, I think it is worth having a
> discussion first about what people like and dislike, and what goals
> they have, and then only if necessary, counting the votes afterwards.

I agree that that's totally worthwhile.

> I don't like having the feeling that because I have a different view
> of something and want to write an email about that, I am somehow an
> impediment to progress.  I think if we reduce discussions to
> you're-for-it-or-your-against-it, that's not that helpful.

That was not my intention. The way that you brought the issue of the
difficulty of being a contributor in general into it was unhelpful,
though. It didn't seem useful or fair to link Tom's position to a big,
well known controversy.

We now have a solution that everyone is happy with, or can at least
live with, which suggests to me that Tom wasn't being intransigent or
insensitive to the concerns of contributors.

-- 
Peter Geoghegan



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Thomas Munro
On Sat, Mar 2, 2019 at 9:35 AM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

It's not so much the "explosion" that bothers me.  I think we should
have a distinct sync requester enum, because we need a way to index
into the table of callbacks.  How exactly you pack the two enums into
compact space seems like a separate question; doing it with two words
would obviously be wasteful, but it should be possible stuff them into
(say) a single uint8_t, uint16_t or whatever will pack nicely in the
request struct and allow the full range of request types (4?) + the
full range of sync requesters (which we propose to expand to 3 in the
forseeable future).  Now perhaps the single enum idea was going to
involve explicit values that encode the two values SYNC_REQ_CANCEL_MD
= 0x1 | (0x04 << 4) so you could still extract the requester part, but
that's just the same thing with uglier code.

-- 
Thomas Munro
https://enterprisedb.com



Re: Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
On Fri, Mar 1, 2019 at 12:59 PM Andrew Gierth 
wrote:

> > "Matt" == Matt Pulver  writes:
>
>  Matt> ERROR:  division by zero
>
>  Matt> Question: If Infinity and NaN are supported, then why throw an
>  Matt> exception here, instead of returning Infinity?
>
> Spec says so:
>
>   4) The dyadic arithmetic operators , ,
>  , and  (+, -, *, and /, respectively) specify
>  addition, subtraction, multiplication, and division, respectively.
>  If the value of a divisor is zero, then an exception condition is
>  raised: data exception -- division by zero.


Thank you, that is what I was looking for. In case anyone else is looking
for source documentation on the standard, there is a link from
https://en.wikipedia.org/wiki/SQL:2003#Documentation_availability to a zip
file of the SQL 2003 draft http://www.wiscorp.com/sql_2003_standard.zip
where one can confirm this (page 242 of 5WD-02-Foundation-2003-09.pdf).


On Fri, Mar 1, 2019 at 2:26 PM David G. Johnston 
wrote:

> On Friday, March 1, 2019, Chapman Flack  wrote:
>
>>
>> But if someone wanted to write a user-defined division function or
>> operator that would return Inf for (anything > 0) / 0 and for
>> (anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
>> (anything > 0) / -0, and NaN for (either zero) / (either zero), I think
>> that function or operator would be fully in keeping with IEEE 754.
>>
>
> Upon further reading you are correct - IEEE 754 has chosen to treat n/0
> differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
> within the scope of this database, and the core arithmetic functions it
> provides, those distinctions don't seeming meaningful and having to add
> query logic to deal with both cases would just be annoying.  I don't use,
> or have time for the distraction, to understand why such a decision was
> made and how it could be useful.  Going from an exception to NaN makes
> sense to me, going instead to infinity - outside of limit expressions which
> aren't applicable here - does not.
>
> For my part in the queries I have that encounter divide-by-zero I end up
> transforming the result to zero which is considerably easier to
> present/absorb along side other valid fractions in a table or chart.
>

In heavy financial/scientific calculations with tables of data, using inf
and nan are very useful, much more so than alternatives such as throwing an
exception (which row(s) included the error?), or replacing them with NULL
or 0. There are many intermediate values where using inf makes sense and
results in finite outcomes at the appropriate limit: atan(1.0/0)=pi/2,
erf(1.0/0)=1, exp(-1.0/0)=0, etc.

In contrast, nan represents a mathematically indeterminate form, in which
the appropriate limit could not be ascertained. E.g. 0.0/0, inf-inf,
0.0*inf, etc. In many applications, I would much rather see calculations
carried out via IEEE 754 all the way to the end, with nans and infs, which
provides much more useful diagnostic information than an exception that
doesn't return any rows at all. As Andres Freund pointed out, it is also
more expensive to do the intermediate checks. Just let IEEE 754 do its
thing! (More directed at the SQL standard than to PostgreSQL.)

Best regards,
Matt


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Haribabu Kommi
On Sat, Mar 2, 2019 at 7:27 AM Robert Haas  wrote:

> On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada 
> wrote:
> > WAL encryption will follow as an additional feature.
>
> I don't think WAL encryption is an optional feature.  You can argue
> about whether it's useful to encrypt the disk files in the first place
> given that there's no privilege boundary between the OS user and the
> database, but a lot of people seem to think it is and maybe they're
> right.  However, who can justify encrypting only SOME of the disk
> files and not others?  I mean, maybe you could justify not encryption
> the SLRU files on the grounds that they probably don't leak much in
> the way of interesting information, but the WAL files certainly do --
> your data is there, just as much as in the data files themselves.
>

+1

WAL encryption is not optional, it must be encrypted.



> To be honest, I think there is a lot to like about the patches
> Cybertec has proposed.  Those patches don't have all of the fancy
> key-management stuff that you are proposing here, but maybe that
> stuff, if we want it, could be added, rather than starting over from
> scratch.  It seems to me that those patches get a lot of things right.
> In particular, it looked to me when I looked at them like they made a
> pretty determined effort to encrypt every byte that might go down to
> the disk.  It seems to me that you if you want encryption, you want
> that.
>


The Cybertec proposed patches are doing the encryption at the instance
level, AFAIK, the current discussion is also trying to reduce the scope of
the
encryption to object level like (tablesapce, database or table) to avoid
the encryption
performance impact for the databases, tables that don't need it.

IMO, the entire performance of encryption depends on WAL encryption, whether
we choose instance level or object level encryption.

As WAL is not an optional encryption, even if the encryption is set to
object
level, the corresponding objects WAL needs to be encrypted, so this should
be
done at the WAL insertion not at WAL write to disk, because some entries are
not encrypted and some not. Or may be something like encrypting entire WAL
even if one object is set for encryption.


Regards,
Haribabu Kommi
Fujitsu Australia


Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-03-01 Thread Robert Haas
On Fri, Sep 14, 2018 at 10:04 AM Antonin Houska  wrote:
> Toshi Harada  wrote:
> > Even if you apply "data-at-rest-encryption-wip-2018.07.25.patch" to the 
> > master branch,
> > the same patch error will occur.
>
> The version attached to this email should be applicable to the current
> branch. Sorry for the delay.

I have a few comments on this patch.  It doesn't seem useful to go
through and make really detailed review comments on everything at this
stage, because it's pretty far from being committable, at least IMHO.
However, I have a few general thoughts.

First of all, this is an impressive piece of work.  It's obvious even
from looking at this quickly that a lot of energy has gone into making
this work, and it touches a lot of stuff, not without reason.  For
example, it's quite impressive that this touches things like the
write-ahead log, logical decoding, and the stats collector, all of
which write to disk.

However, this is also quite invasive.  It changes a lot of code and it
doesn't do so in a very predictable way.  It's not like you went
through and replaced every call to write() with a call to
SpecialEncyptionMagicWrite().  Rather, there are new #ifdef
USE_ENCRYPTION blocks all over the code base.  I think it will be
important to look for ways to refactor this functionality to reduce
that sort of thing to the absolute minimum possible.  Encryption needs
to be something that the code needs to know about here and there, but
not everywhere.

Some of the changes are things that could perhaps be done as separate,
preparatory patches.  For instance, pgstat.c gets a heavy refactoring
to make it do I/O in bigger chunks.  There's no reason that you
couldn't separate some of those changes out, clean them up, and get
them committed separately.  It would be more efficient even as things
stand.  OK, well, there is one reason: we may be getting a
shared-memory stats collector soon, and if we do, then the need for
all of this will go away.  But the general point is valid: whatever is
a useful cleanup apart from this patch should be done separately from
this patch.

As far as possible, we need to try to do things in the same way in the
encrypted and non-encrypted cases.  For example, it's pretty hard to
feel good about code like this:

+ sz_hdr = sizeof(ReorderBufferDiskChange);
+ if (data_encrypted)
+#ifdef USE_ENCRYPTION
+ sz_hdr = TYPEALIGN(ENCRYPTION_BLOCK, sz_hdr);
+#else
+ ENCRYPTION_NOT_SUPPORTED_MSG;
+#endif /* USE_ENCRYPTION */

You won't be able to have much confidence that this code works both
with and without encryption unless you test it both ways, because the
file format is different in those two cases, and not just by being
encrypted.  That means that anyone who modifies this code in the
future has two ways of breaking it.  They can break the normal case,
or they can break the encrypted case.  If encryption were available as
a sort of service that everyone could use, then that would probably be
fine, but like I said above, I think something as invasive as this
currently is will lead to a lot of complaints.

The code needs a lot more documentation, not only SGML documentation
but also code comments and probably a README file someplace.  There is
a lot of discussion in the comments here of encryption tweaks, but
there's no general introduction to the topic (what's a tweak?) or --
and I think this is important -- what the idea between the choice of
various tweaks in different places actually is.  There's probably some
motivating philosophy behind the way the tweaks are chosen that should
be explained someplace.  Think about it this way: if some hapless
developer, let's say me, wants to add a new module to PostgreSQL which
happens to write data to disk, that person needs an easy way to
understand what they need to do to make that new code play nice with
encryption.  Right now, the code for every existing module is a little
different (uggh) and there's this encryption tweak stuff that you
about which you have to do something intelligent, but there's nothing
that tells you how to be intelligent about it, at least not that I
see.  People are going to need something that looks more like a
formula that they can just copy, and a good introduction to the
principles in some appropriate place, too.

The interaction of this capability with certain tricks that PostgreSQL
plays needs some thought -- and documentation, at least developer
documentation, maybe user documentation.  One area is recovery.
Writing WAL relies on the fact that if you are in the midst of
rewriting a block and the power fails, bytes that are the same in both
the old and new block images will be undisturbed; encryption will make
that false.  How's that going to work?  Hint bits also rely on this
principle.  I thought there might be some interaction between this
work and wal_log_hints for this reason, but I see nothing of that sort
in the patch.  Full page writes seem related too; I don't know how
something like this can be safe without bo

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:52 PM Haribabu Kommi  wrote:
> The Cybertec proposed patches are doing the encryption at the instance
> level, AFAIK, the current discussion is also trying to reduce the scope of the
> encryption to object level like (tablesapce, database or table) to avoid the 
> encryption
> performance impact for the databases, tables that don't need it.

The trick there is that it becomes difficult to figure out which keys
to use for certain things.  For example, you could say, well, this WAL
record is for a table that is encrypted with key 123, so let's use key
123 to encrypt the WAL record also.  So far, so good.  But then how do
you encrypt, say, a logical decoding spill file?  That could have data
in it mixed together from multiple relations, IIUC.  Or what do you do
about SLRUs or other global structures?  If you just exclude that
stuff from the scope of encryption, then you aren't helping the people
who want to Just Encrypt Everything.

Now that having been said I bet a lot of people would find it pretty
cool if we could make this work on a per-table basis.  And I'm not
opposed to that.  I just think it's really hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

I thought the idea was that each smgr might have a different set of
requests.  If they're all going to have the same set of requests then
I agree with you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 3:49 PM, Matt Pulver wrote:

> In many applications, I would much rather see calculations carried out
> via IEEE 754 all the way to the end, with nans and infs, which
> provides much more useful diagnostic information than an exception that
> doesn't return any rows at all. As Andres Freund pointed out, it is also
> more expensive to do the intermediate checks. Just let IEEE 754 do its
> thing! (More directed at the SQL standard than to PostgreSQL.)

I wanted to try this out a little before assuming it would work,
and there seems to be no trouble creating a trivial domain over
float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
operators whose operand types are the domain type.

So it seems an extension could easily do that, and supply happily
inf-returning and NaN-returning versions of the operators and
functions, and those will be used whenever operands have the domain
type.

It might even be useful and relatively elegant, while leaving the
SQL-specified base types to have the SQL-specified behavior.

-Chap



Re: Drop type "smgr"?

2019-03-01 Thread Thomas Munro
On Fri, Mar 1, 2019 at 9:11 PM Konstantin Knizhnik
 wrote:
> One more thing... From my point of view one of the drawbacks of Postgres
> is that it requires underlaying file system and is not able to work with
> raw partitions.
> It seems to me that bypassing fle system layer can significantly improve
> performance and give more possibilities for IO performance tuning.
> Certainly it will require a log of changes in Postgres storage layer so
> this is not what I suggest to implement or even discuss right now.
> But it may be useful to keep it in mind in discussions concerning
> "generic storage manager".

Hmm.  Speculation-around-the-water-cooler-mode:  I think the arguments
for using raw partitions are approximately the same as the arguments
for using a big data file that holds many relations.  The three I can
think of are (1) the space is entirely preallocated, which *might*
have performance and safety advantages, (2) if you encrypt it, no one
can see the structure (database OIDs, relation OIDs and sizes) and (3)
it might allow pages to be moved from one relation to another without
copying or interleaved in interesting ways (think SQL Server parallel
btree creation that "stitches together" btrees produced by parallel
workers, or Oracle "clustered" tables where the pages of two tables
are physically interleaved in an order that works nicely when you join
those two tables, or perhaps schemes for moving data between
relations/partitions quickly).  On the other hand, to make that work
you have to own the problem of space allocation/management that we
currently leave to the authors of XFS et al, and those guys have
worked on that for *years and years* and they work really well.  If
you made all that work for big preallocated data files, then sure, you
could also make it work for raw partitions, but I'm not sure how much
performance advantage there is for that final step.  I suspect that a
major reason for Oracle to support raw block devices many years ago
was because back then there was no other way to escape from the page
cache.  Direct IO hasn't always been available or portable, and hasn't
always worked well.  That said, it does seem plausible that we could
do the separation of (1) block -> pathname/offset mappings and (2)
actual IO operations in a way that you could potentially write your
own pseudo-filesystem that stores a whole PostgreSQL cluster inside
big data files or raw partitions.  Luckily we don't need to tackle
such mountainous terrain to avoid the page cache, today.

-- 
Thomas Munro
https://enterprisedb.com



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-01 Thread Joe Conway
On 2/28/19 9:33 PM, Michael Paquier wrote:
> Hi all,
> 
> Joe's message here has reminded me that we have lacked a lot of error
> handling around CloseTransientFile():
> https://www.postgresql.org/message-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2ce...@joeconway.com
> 
> This has been mentioned by Alvaro a couple of months ago (cannot find
> the thread about that at quick glance), and I just forgot about it at
> that time.  Anyway, attached is a patch to do some cleanup for all
> that:
> - Switch OpenTransientFile to read-only where sufficient.
> - Add more error handling for CloseTransientFile
> A major take of this patch is to make sure that the new error messages
> generated have an elevel consistent with their neighbors.
> 
> Just on time for this last CF.  Thoughts?

Seems like it would be better to modify the arguments to
CloseTransientFile() to include the filename being closed, errorlevel,
and fail_on_error or something similar. Then all the repeated ereport
stanzas could be eliminated.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
On Fri, Mar 1, 2019 at 4:51 PM Chapman Flack  wrote:

> On 3/1/19 3:49 PM, Matt Pulver wrote:
>
> > In many applications, I would much rather see calculations carried out
> > via IEEE 754 all the way to the end, with nans and infs, which
> > provides much more useful diagnostic information than an exception that
> > doesn't return any rows at all. As Andres Freund pointed out, it is also
> > more expensive to do the intermediate checks. Just let IEEE 754 do its
> > thing! (More directed at the SQL standard than to PostgreSQL.)
>
> I wanted to try this out a little before assuming it would work,
> and there seems to be no trouble creating a trivial domain over
> float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
> operators whose operand types are the domain type.
>
> So it seems an extension could easily do that, and supply happily
> inf-returning and NaN-returning versions of the operators and
> functions, and those will be used whenever operands have the domain
> type.
>
> It might even be useful and relatively elegant, while leaving the
> SQL-specified base types to have the SQL-specified behavior.
>

That would be very useful. I've been wanting this for years, and I'm sure
the data users I work with will appreciate it (but don't directly
understand this to be the solution).

There are issues relating to ordering and aggregation that perhaps are
already transparent to you, but I'll mention anyway for the record.
Conceptually, there would be different contexts of ordering:

   1. When writing mathematical functions, <, =, and > are all false when
   comparing to NaN (NaN != NaN is true.)
   2. In SQL when sorting or aggregating, NaN=NaN. Consider that there are
   2^53-2 different double precision representations of NaN at the bit level.
   Under the same floating point ordering logic used for finite numbers, when
   applied to inf and nan, we get the following ordering: -nan < -inf < (all
   finite numbers) < inf < nan. When the bit patterns are taken into
   consideration, an efficient sort algorithm can be implemented. (Forgive me
   for stating the obvious, but just mentioning this for whoever is going to
   take this on.)

I would be most interested to hear of and discuss any other unforeseen
complications or side-effects.

Best regards,
Matt


Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>
> Andres Freund  writes:
> > I've not checked, but could we please make sure these cases are covered
> > in the regression tests today with a single liner?
>
> I'm not sure if the second one is actually a semantics bug or just a
> misoptimization?  But yeah, +1 for putting in some simple tests for
> corner cases right now.  Anyone want to propose a specific patch?

The second is just reducing the planner's flexibility to produce a
good plan.  The first is a bug. Proposed regression test attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/test/regress/expected/subselect.out 
b/src/test/regress/expected/subselect.out
index cc3f5f3737..1b09b3e3fd 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -830,6 +830,20 @@ explain (verbose, costs off)
  One-Time Filter: ("*VALUES*".column1 = "*VALUES*".column1)
 (8 rows)
 
+--
+-- Check we don't filter NULL outer rows in a NOT IN where the subquery
+-- returns no rows.
+--
+create temp table notinouter (a int);
+create temp table notininner (a int not null);
+insert into notinouter values(null),(1);
+select * from notinouter where a not in(select a from notininner);
+ a 
+---
+  
+ 1
+(2 rows)
+
 --
 -- Check we behave sanely in corner case of empty SELECT list (bug #8648)
 --
diff --git a/src/test/regress/sql/subselect.sql 
b/src/test/regress/sql/subselect.sql
index 8bca1f5d55..48230d4671 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -466,6 +466,16 @@ explain (verbose, costs off)
   select x, x from
 (select (select random() where y=y) as x from (values(1),(2)) v(y)) ss;
 
+--
+-- Check we don't filter NULL outer rows in a NOT IN where the subquery
+-- returns no rows.
+--
+create temp table notinouter (a int);
+create temp table notininner (a int not null);
+insert into notinouter values(null),(1);
+
+select * from notinouter where a not in(select a from notininner);
+
 --
 -- Check we behave sanely in corner case of empty SELECT list (bug #8648)
 --


Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-03-01 Thread Tom Lane
James Coleman  writes:
> [ saop_is_not_null-v10.patch ]

I went through this again, and this time (after some more rewriting
of the comments) I satisfied myself that the logic is correct.
Hence, pushed.

I stripped down the regression test cases somewhat.  Those were
good for development, but they were about doubling the runtime of
test_predtest.sql, which seemed excessive for testing one feature.

I also tweaked it to recognize the case where we can prove the
array, rather than the scalar, to be null.  I'm not sure how useful
that is in practice, but it seemed weird that we'd exploit that
only if we can also prove the scalar to be null.

> I've implemented this fix as suggested. The added argument I've
> initially called "allow_false"; I don't love the name, but it is
> directly what it does. I'd appreciate other suggestions (if you prefer
> it change).

I was tempted to rename it to "weak", but decided that that might be
overloading that term too much in this module.  Left it as-is.

> Ideally I think this case would be handled elsewhere in the optimizer
> by directly replacing the saop and a constant NULL array with NULL,
> but this isn't the patch to do that, and at any rate I'd have to do a
> bit more investigation to understand how to do such (if it's easily
> possible).

Take a look at the ScalarArrayOp case in eval_const_expressions.
Right now it's only smart about the all-inputs-constant case.
I'm not really convinced it's worth spending cycles on the constant-
null-array case, but that'd be where to do it if we want to do it
in a general way.  (I think that what I added to clause_is_strict_for
is far cheaper, because while it's the same test, it's in a place
that we won't hit during most queries.)

> I also updated the tests with a new helper function "opaque_array" to
> make it easy to test cases where the planner can't inspect the array.

Yeah, that's a win.  I used that in most of the tests that I left in
place, but I kept a couple with long arrays just so we'd have code
coverage of the parts of clause_is_strict_for that need to detect
array size.

regards, tom lane



Re: Infinity vs Error for division by zero

2019-03-01 Thread Tom Lane
Chapman Flack  writes:
> I wanted to try this out a little before assuming it would work,
> and there seems to be no trouble creating a trivial domain over
> float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
> operators whose operand types are the domain type.

While you can do that to some extent, you don't have a lot of control
over when the parser will use your operator --- basically, it'll only
do so if both inputs are exact type matches.  Maybe that'd be enough
but I think it'd be fragile to use.  (See the "Type Conversion"
chapter in the manual for the gory details, and note that domains
get smashed to their base types mighty readily.)

Using custom operator names would work better/more reliably.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
David Rowley  writes:
> On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>> I'm not sure if the second one is actually a semantics bug or just a
>> misoptimization?  But yeah, +1 for putting in some simple tests for
>> corner cases right now.  Anyone want to propose a specific patch?

> The second is just reducing the planner's flexibility to produce a
> good plan.  The first is a bug. Proposed regression test attached.

LGTM, pushed.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread Li, Zheng
Thanks all for the feedbacks! I'm working on a refined patch.

Although adding "or var is NULL" to the anti join condition forces the planner 
to choose nested loop anti join, it is always faster compared to the original 
plan. In order to enable the transformation from NOT IN to anti join when the 
inner/outer is nullable, we have to add some NULL test to the join condition.

We could make anti join t1, t2 on (t1.x = t2.y or t2.y IS NULL) eligible for 
hashjoin, it would require changes in allowing this special join quals for hash 
join as well as changes in hash join executor to handle NULL accordingly for 
the case.

Another option of transformation is to add "is not false" on top of the join 
condition.

Regards,
Zheng
On 3/1/19, 5:28 PM, "David Rowley"  wrote:

On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>
> Andres Freund  writes:
> > I've not checked, but could we please make sure these cases are covered
> > in the regression tests today with a single liner?
>
> I'm not sure if the second one is actually a semantics bug or just a
> misoptimization?  But yeah, +1 for putting in some simple tests for
> corner cases right now.  Anyone want to propose a specific patch?

The second is just reducing the planner's flexibility to produce a
good plan.  The first is a bug. Proposed regression test attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Online verification of checksums

2019-03-01 Thread Robert Haas
On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
 wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

Maybe I'm confused here, but catching 80% of torn pages doesn't sound
robust at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



  1   2   3   >