Re: pg_dump multi VALUES INSERT

2019-03-01 Thread Fabien COELHO



Hello David & Surafel,


I think this can be marked as ready for committer now, but I'll defer
to Fabien to see if he's any other comments.


Patch v16 applies and compiles cleanly, local and global "make check" 
are ok. Doc build is ok.


I did some manual testing with limit cases which did work. Good.

Although I'm all in favor of checking the int associated to the option, I 
do not think that it warrants three checks and messages. I would suggest 
to factor them all as just one check and one (terse) message.


Option "--help" line: number of row*s* ?

About the output: I'd suggest to indent one line per row, something like:

  INSERT INTO foo VALUES
(..., ..., ..., ...),
(..., ..., ..., ...),
(..., ..., ..., ...);

so as to avoid very very very very very very very very very very very very 
very very very very long lines in the output.


I'd suggest to add test tables with (1) no rows and (2) no columns but a 
few rows, with multiple --table options.


--
Fabien.



Re: [HACKERS] Incomplete startup packet errors

2019-03-01 Thread Tom Lane
Andrew Dunstan  writes:
> On 3/1/19 6:49 PM, Tom Lane wrote:
>> No patch referenced, but I assume you mean only for the
>> zero-bytes-received case, right?  No objection if so.

> Patch proposed by Christoph Berg is here:
> https://www.postgresql.org/message-id/20190228151336.GB7550%40msg.df7cb.de

Meh.  That doesn't silence only the zero-bytes case, and I'm also
rather afraid of the fact that it's changing COMMERROR to something
else.  I wonder whether (if client_min_messages <= DEBUG1) it could
result in trying to send the error message to the already-lost
connection.  It might be that that can't happen, but I think a fair
amount of rather subtle (and breakable) analysis may be needed.

regards, tom lane



Re: [HACKERS] Incomplete startup packet errors

2019-03-01 Thread Andrew Dunstan


On 3/1/19 6:49 PM, Tom Lane wrote:
> Andrew Dunstan  writes:
>> So I propose shortly to commit this patch unconditionally demoting the
>> message to DEBUG1.
> No patch referenced, but I assume you mean only for the
> zero-bytes-received case, right?  No objection if so.
>
>   


Patch proposed by Christoph Berg is here:


https://www.postgresql.org/message-id/20190228151336.GB7550%40msg.df7cb.de


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2019 at 5:00 PM Peter Geoghegan  wrote:
> I favor keeping the test, but having it throw a
> ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A
> comment could point out that the test is historical/defensive, and
> probably isn't actually necessary. What do you think of that idea?

Actually, while 9.4 did indeed start treating "internal + half dead"
pages as corrupt, it didn't exactly remove the *concept* of a half
dead internal page. I think that the cross check (the one referenced
by comments above the corresponding leaf/_bt_mark_page_halfdead() call
to _bt_is_page_halfdead()) might have problems in the event of an
interrupted *multi-level* page deletion. I wonder, is there a subtle
bug here that bugfix commit 8da31837803 didn't quite manage to
prevent? (This commit added both of the _bt_is_page_halfdead()
checks.)

(Thinks some more...)

Actually, I think that bugfix commit 8da31837803 works despite
possible "logically half dead internal pages", because in the event of
such an internal page the sibling would actually have to be the
*cousin* of the original parent (half dead/leaf page parent), not the
"true sibling" (otherwise, cousin's multi-level page deletion should
never have taken place). I think that we'll end up doing the right
thing with the downlinks in the grandparent page, despite there being
an interrupted multi-level deletion in the cousin's subtree. Since
cousin *atomically* removed its downlink in our shared *grandparent*
(not its parent) at the same time some leaf page was initially marked
half-dead, everything works out.

Page deletion is painfully complicated. Seems wise to keep the
internal page test, out of sheer paranoia, while making it an error as
suggested earlier. I will definitely want to think about it some more,
though.

-- 
Peter Geoghegan



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2019 at 4:41 PM Tom Lane  wrote:
> > FWIW, I notice that the logic that appears after the
> > _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
> > must defend against interrupted splits in at least the
> > grandparent-of-leaf page, and maybe even the parent, so it's probably
> > not unworkable to not finish the split:
>
> -ETOOMANYNEGATIVES ... can't quite parse your point here?

Sorry.   :-)

My point was that it actually seems feasible to not do the split,
making the quoted paragraph from nbtree README correct as-is. But,
since we're happy to continue to finish the occasional interrupted
internal page split within VACUUM anyway, that isn't an important
point.

> I think that code is there to deal with the possibility of finding
> an old half-dead page.  Don't know that it's safe to remove it yet.

I don't know that it is either. My first instinct is to assume that
it's fine to remove the code, since, as I said, we're treating
internal pages that are half-dead as being ipso facto corrupt -- we'll
throw an error before too long anyway. However, "internal + half dead"
was a valid state for an nbtree  page prior to 9.4, and we make no
distinction about that (versioning nbtree indexes to deal with
cross-version incompatibilities came only in Postgres v11). Trying to
analyze whether or not it's truly safe to just not do this seems very
difficult, and I don't think that it's performance critical. This is a
problem only because it's distracting and confusing.

I favor keeping the test, but having it throw a
ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A
comment could point out that the test is historical/defensive, and
probably isn't actually necessary. What do you think of that idea?

-- 
Peter Geoghegan



Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
David Rowley  writes:
> I think you're fighting a losing battle here with adding OR quals to
> the join condition.

Yeah --- that has a nontrivial risk of making things significantly worse,
which makes it a hard sell.  I think the most reasonable bet here is
simply to not perform the transformation if we can't prove the inner side
NOT NULL.  That's going to catch most of the useful cases anyway IMO.

regards, tom lane



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-03-01 Thread Tom Lane
Peter Geoghegan  writes:
> _bt_lock_branch_parent() is used by VACUUM during page deletion, and
> calls _bt_getstackbuf(), which always finishes incomplete page splits
> for the parent page that it exclusive locks and returns. ISTM that
> this may be problematic, since it contradicts the general rule that
> VACUUM isn't supposed to finish incomplete page splits. According to
> the nbtree README:

> "It would seem natural to add the missing downlinks in VACUUM, but since
> inserting a downlink might require splitting a page, it might fail if you
> run out of disk space.  That would be bad during VACUUM - the reason for
> running VACUUM in the first place might be that you run out of disk space,
> and now VACUUM won't finish because you're out of disk space.  In contrast,
> an insertion can require enlarging the physical file anyway."

> I'm inclined to note this as an exception in the nbtree README, and
> leave it at that. Interrupted internal page splits are probably very
> rare in practice, so the operational risk of running out of disk space
> like this is minimal.

Also, if your WAL is on the same filesystem as the data, the whole
thing is pretty much moot anyway since VACUUM is surely going to add
WAL output.  I concur with not sweating over this.

> FWIW, I notice that the logic that appears after the
> _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
> must defend against interrupted splits in at least the
> grandparent-of-leaf page, and maybe even the parent, so it's probably
> not unworkable to not finish the split:

-ETOOMANYNEGATIVES ... can't quite parse your point here?

> I thought that internal pages were never half-dead after Postgres 9.4.
> If that happens, then the check within _bt_pagedel() will throw an
> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
> this internal level _bt_is_page_halfdead() check contain a "can't
> happen" error, or even a simple assertion?

I think that code is there to deal with the possibility of finding
an old half-dead page.  Don't know that it's safe to remove it yet.

regards, tom lane



Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-01 Thread Michael Paquier
On Fri, Mar 01, 2019 at 05:05:54PM -0500, Joe Conway wrote:
> Seems like it would be better to modify the arguments to
> CloseTransientFile() to include the filename being closed, errorlevel,
> and fail_on_error or something similar. Then all the repeated ereport
> stanzas could be eliminated.

Sure.  Now some code paths close file descriptors without having at
hand the file name, which would mean that we'd need to pass NULL as
argument in this case.  That's not really elegant in my opinion.  And
having a consistent mapping with the system's close() is not really
bad to me either..
--
Michael


signature.asc
Description: PGP signature


GSoC 2019

2019-03-01 Thread Sumukha Pk
Hello all!

I am Sumukha PK a student of NITK. I am interested in the WAL-G backup tool. I 
haven’t been able to catch hold of anyone through the IRC channels so I need 
someone to point me to appropriate resources so that I can be introduced to it. 
I am proficient in Golang an would be interested to work on this project.

Thanks



Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 12:39, Li, Zheng  wrote:
> However, if s.a is nullable, we would do this transformation:
> select count(*) from big b where not exists(select 1 from small s
> where s.a = b.a or s.a is null);

I understand you're keen to make this work, but you're assuming again
that forcing the planner into a nested loop plan is going to be a win
over the current behaviour. It may well be in some cases, but it's
very simple to show cases where it's a significant regression.

Using the same tables from earlier, and again with master:

alter table small alter column a drop not null;
select * from big where a not in(select a from small);
Time: 430.283 ms

Here's what you're proposing:

select * from big b where not exists(select 1 from small s where s.a =
b.a or s.a is null);
Time: 37419.646 ms (00:37.420)

about 80 times slower. Making "small" a little less small would likely
see that gap grow even further.

I think you're fighting a losing battle here with adding OR quals to
the join condition. This transformation happens so early in planning
that you really can't cost it out either.  I think the only way that
could be made to work satisfactorily would be with some execution
level support for it.  Short of that, you're left with just adding
checks that either side of the join cannot produce NULL values...
That's what I've proposed in [1].

[1] 
https://www.postgresql.org/message-id/CAKJS1f_OA5VeZx8A8H8mkj3uqEgOtmHBGCUA6%2BxqgmUJ6JQURw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



VACUUM can finish an interrupted nbtree page split -- is that okay?

2019-03-01 Thread Peter Geoghegan
_bt_lock_branch_parent() is used by VACUUM during page deletion, and
calls _bt_getstackbuf(), which always finishes incomplete page splits
for the parent page that it exclusive locks and returns. ISTM that
this may be problematic, since it contradicts the general rule that
VACUUM isn't supposed to finish incomplete page splits. According to
the nbtree README:

"It would seem natural to add the missing downlinks in VACUUM, but since
inserting a downlink might require splitting a page, it might fail if you
run out of disk space.  That would be bad during VACUUM - the reason for
running VACUUM in the first place might be that you run out of disk space,
and now VACUUM won't finish because you're out of disk space.  In contrast,
an insertion can require enlarging the physical file anyway."

I'm inclined to note this as an exception in the nbtree README, and
leave it at that. Interrupted internal page splits are probably very
rare in practice, so the operational risk of running out of disk space
like this is minimal.

FWIW, I notice that the logic that appears after the
_bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
must defend against interrupted splits in at least the
grandparent-of-leaf page, and maybe even the parent, so it's probably
not unworkable to not finish the split:

/*
 * If the target is the rightmost child of its parent, then we can't
 * delete, unless it's also the only child.
 */
if (poffset >= maxoff)
{
/* It's rightmost child... */
if (poffset == P_FIRSTDATAKEY(opaque))
{
/*
 * It's only child, so safe if parent would itself be removable.
 * We have to check the parent itself, and then recurse to test
 * the conditions at the parent's parent.
 */
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
P_INCOMPLETE_SPLIT(opaque))
{
_bt_relbuf(rel, pbuf);
return false;
}

Separately, I noticed another minor issue that appears a few lines
further down still:

/*
 * Perform the same check on this internal level that
 * _bt_mark_page_halfdead performed on the leaf level.
 */
if (_bt_is_page_halfdead(rel, *rightsib))
{
elog(DEBUG1, "could not delete page %u because its
right sibling %u is half-dead",
 parent, *rightsib);
return false;
}

I thought that internal pages were never half-dead after Postgres 9.4.
If that happens, then the check within _bt_pagedel() will throw an
ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
this internal level _bt_is_page_halfdead() check contain a "can't
happen" error, or even a simple assertion?

-- 
Peter Geoghegan



Re: [HACKERS] Incomplete startup packet errors

2019-03-01 Thread Tom Lane
Andrew Dunstan  writes:
> So I propose shortly to commit this patch unconditionally demoting the
> message to DEBUG1.

No patch referenced, but I assume you mean only for the
zero-bytes-received case, right?  No objection if so.

regards, tom lane



Re: Infinity vs Error for division by zero

2019-03-01 Thread Tom Lane
Chapman Flack  writes:
> On 03/01/19 17:34, Tom Lane wrote:
>> Using custom operator names would work better/more reliably.

> Or a new base type (LIKE float8) rather than a domain?

Yeah, it'd be more work but you would have control over the
coercion rules.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread Li, Zheng
The current transformation would not add "or s.a is NULL" in the example 
provided since it is non-nullable. You will be comparing these two cases in 
terms of the transformation:

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 51.416 ms

select count(*) from big b where a not in (select a from s);
Time: 890.088 ms

 But if s.a is nullable, yes, you have proved my previous statement is false... 
I should have used almost always.
However, if s.a is nullable, we would do this transformation:
select count(*) from big b where not exists(select 1 from small s
where s.a = b.a or s.a is null);

It's possible to stop the nested loop join early during execution once we find 
an inner Null entry because every outer tuple is
going to evaluate to true on the join condition.

Zheng

On 3/1/19, 6:17 PM, "David Rowley"  wrote:

On Sat, 2 Mar 2019 at 12:13, Tom Lane  wrote:
>
> "Li, Zheng"  writes:
> > Although adding "or var is NULL" to the anti join condition forces the 
planner to choose nested loop anti join, it is always faster compared to the 
original plan.
>
> TBH, I am *really* skeptical of sweeping claims like that.  The existing
> code will typically produce a hashed-subplan plan, which ought not be
> that awful as long as the subquery result doesn't blow out memory.
> It certainly is going to beat a naive nested loop.

It's pretty easy to show the claim is false using master and NOT EXISTS.

create table small(a int not null);
create table big (a int not null);
insert into small select generate_Series(1,1000);
insert into big select x%1000+1 from generate_Series(1,100) x;

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 178.575 ms

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a or s.a is null);
Time: 38049.969 ms (00:38.050)


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 03/01/19 17:34, Tom Lane wrote:

> but I think it'd be fragile to use.  (See the "Type Conversion"
> chapter in the manual for the gory details, and note that domains
> get smashed to their base types mighty readily.)
> 
> Using custom operator names would work better/more reliably.

Or a new base type (LIKE float8) rather than a domain?

Regards,
-Chap



Re: [HACKERS] Incomplete startup packet errors

2019-03-01 Thread Andrew Dunstan



On 2/28/19 10:13 AM, Christoph Berg wrote:
> Re: Magnus Hagander 2016-04-13 
> 
>> It's fairly common to see a lot of "Incomplete startup packet" in the
>> logfiles caused by monitoring or healthcheck connections.
> I've also seen it caused by port scanning.
 Yes, definitely. Question there might be if that's actually a case when
>>> we
 *want* that logging?
>>> I should think someone might.  But I doubt we want to introduce another
>>> GUC for this.  Would it be okay to downgrade the message to DEBUG1 if
>>> zero bytes were received?
>>>
>>>
>> Yeah, that was my suggestion - I think that's a reasonable compromise.  And
>> yes, I agree that a separate GUC for it would be a huge overkill.
> There have been numerous complaints about that log message, and the
> usual reply is always something like what Pavel said recently:
>
> "It is garbage. Usually it means nothing, but better to work live
> without this garbage." [1]
>
> [1] 
> https://www.postgresql.org/message-id/CAFj8pRDtwsxj63%3DLaWSwA8u7NrU9k9%2BdJtz2gB_0f4SxCM1sQA%40mail.gmail.com
>
> Let's get rid of it.



Right. This has annoyed me and a great many other people for years. I
think Robert Haas' argument 3 years ago (!) was on point, and disposes
of suggestions to keep it:


3. The right way to detect attacks is through OS-level monitoring or
firewall-level monitoring, and nothing we do in PG is going to come
close to the same value.


So I propose shortly to commit this patch unconditionally demoting the
message to DEBUG1.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 12:13, Tom Lane  wrote:
>
> "Li, Zheng"  writes:
> > Although adding "or var is NULL" to the anti join condition forces the 
> > planner to choose nested loop anti join, it is always faster compared to 
> > the original plan.
>
> TBH, I am *really* skeptical of sweeping claims like that.  The existing
> code will typically produce a hashed-subplan plan, which ought not be
> that awful as long as the subquery result doesn't blow out memory.
> It certainly is going to beat a naive nested loop.

It's pretty easy to show the claim is false using master and NOT EXISTS.

create table small(a int not null);
create table big (a int not null);
insert into small select generate_Series(1,1000);
insert into big select x%1000+1 from generate_Series(1,100) x;

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a);
Time: 178.575 ms

select count(*) from big b where not exists(select 1 from small s
where s.a = b.a or s.a is null);
Time: 38049.969 ms (00:38.050)


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
"Li, Zheng"  writes:
> Although adding "or var is NULL" to the anti join condition forces the 
> planner to choose nested loop anti join, it is always faster compared to the 
> original plan.

TBH, I am *really* skeptical of sweeping claims like that.  The existing
code will typically produce a hashed-subplan plan, which ought not be
that awful as long as the subquery result doesn't blow out memory.
It certainly is going to beat a naive nested loop.

regards, tom lane



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Tom Lane
Peter Geoghegan  writes:
> On Fri, Mar 1, 2019 at 11:56 AM Robert Haas  wrote:
>> It would be neat if there were a tool you could run to somehow tell
>> you whether catversion needs to be changed for a given patch.

> That seems infeasible because of stored rules. A lot of things bleed
> into that. We could certainly do better at documenting this on the
> "committing checklist" page, though.

A first approximation to that is "did you touch readfuncs.c", though
that rule will give a false positive if you only changed Plan nodes.

regards, tom lane



Re: Online verification of checksums

2019-03-01 Thread Robert Haas
On Tue, Sep 18, 2018 at 10:37 AM Michael Banck
 wrote:
> I have added a retry for this as well now, without a pg_sleep() as well.
> This catches around 80% of the half-reads, but a few slip through. At
> that point we bail out with exit(1), and the user can try again, which I
> think is fine?

Maybe I'm confused here, but catching 80% of torn pages doesn't sound
robust at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: NOT IN subquery optimization

2019-03-01 Thread Li, Zheng
Thanks all for the feedbacks! I'm working on a refined patch.

Although adding "or var is NULL" to the anti join condition forces the planner 
to choose nested loop anti join, it is always faster compared to the original 
plan. In order to enable the transformation from NOT IN to anti join when the 
inner/outer is nullable, we have to add some NULL test to the join condition.

We could make anti join t1, t2 on (t1.x = t2.y or t2.y IS NULL) eligible for 
hashjoin, it would require changes in allowing this special join quals for hash 
join as well as changes in hash join executor to handle NULL accordingly for 
the case.

Another option of transformation is to add "is not false" on top of the join 
condition.

Regards,
Zheng
On 3/1/19, 5:28 PM, "David Rowley"  wrote:

On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>
> Andres Freund  writes:
> > I've not checked, but could we please make sure these cases are covered
> > in the regression tests today with a single liner?
>
> I'm not sure if the second one is actually a semantics bug or just a
> misoptimization?  But yeah, +1 for putting in some simple tests for
> corner cases right now.  Anyone want to propose a specific patch?

The second is just reducing the planner's flexibility to produce a
good plan.  The first is a bug. Proposed regression test attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
David Rowley  writes:
> On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>> I'm not sure if the second one is actually a semantics bug or just a
>> misoptimization?  But yeah, +1 for putting in some simple tests for
>> corner cases right now.  Anyone want to propose a specific patch?

> The second is just reducing the planner's flexibility to produce a
> good plan.  The first is a bug. Proposed regression test attached.

LGTM, pushed.

regards, tom lane



Re: Infinity vs Error for division by zero

2019-03-01 Thread Tom Lane
Chapman Flack  writes:
> I wanted to try this out a little before assuming it would work,
> and there seems to be no trouble creating a trivial domain over
> float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
> operators whose operand types are the domain type.

While you can do that to some extent, you don't have a lot of control
over when the parser will use your operator --- basically, it'll only
do so if both inputs are exact type matches.  Maybe that'd be enough
but I think it'd be fragile to use.  (See the "Type Conversion"
chapter in the manual for the gory details, and note that domains
get smashed to their base types mighty readily.)

Using custom operator names would work better/more reliably.

regards, tom lane



Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-03-01 Thread Tom Lane
James Coleman  writes:
> [ saop_is_not_null-v10.patch ]

I went through this again, and this time (after some more rewriting
of the comments) I satisfied myself that the logic is correct.
Hence, pushed.

I stripped down the regression test cases somewhat.  Those were
good for development, but they were about doubling the runtime of
test_predtest.sql, which seemed excessive for testing one feature.

I also tweaked it to recognize the case where we can prove the
array, rather than the scalar, to be null.  I'm not sure how useful
that is in practice, but it seemed weird that we'd exploit that
only if we can also prove the scalar to be null.

> I've implemented this fix as suggested. The added argument I've
> initially called "allow_false"; I don't love the name, but it is
> directly what it does. I'd appreciate other suggestions (if you prefer
> it change).

I was tempted to rename it to "weak", but decided that that might be
overloading that term too much in this module.  Left it as-is.

> Ideally I think this case would be handled elsewhere in the optimizer
> by directly replacing the saop and a constant NULL array with NULL,
> but this isn't the patch to do that, and at any rate I'd have to do a
> bit more investigation to understand how to do such (if it's easily
> possible).

Take a look at the ScalarArrayOp case in eval_const_expressions.
Right now it's only smart about the all-inputs-constant case.
I'm not really convinced it's worth spending cycles on the constant-
null-array case, but that'd be where to do it if we want to do it
in a general way.  (I think that what I added to clause_is_strict_for
is far cheaper, because while it's the same test, it's in a place
that we won't hit during most queries.)

> I also updated the tests with a new helper function "opaque_array" to
> make it easy to test cases where the planner can't inspect the array.

Yeah, that's a win.  I used that in most of the tests that I left in
place, but I kept a couple with long arrays just so we'd have code
coverage of the parts of clause_is_strict_for that need to detect
array size.

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Sat, 2 Mar 2019 at 05:44, Tom Lane  wrote:
>
> Andres Freund  writes:
> > I've not checked, but could we please make sure these cases are covered
> > in the regression tests today with a single liner?
>
> I'm not sure if the second one is actually a semantics bug or just a
> misoptimization?  But yeah, +1 for putting in some simple tests for
> corner cases right now.  Anyone want to propose a specific patch?

The second is just reducing the planner's flexibility to produce a
good plan.  The first is a bug. Proposed regression test attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/test/regress/expected/subselect.out 
b/src/test/regress/expected/subselect.out
index cc3f5f3737..1b09b3e3fd 100644
--- a/src/test/regress/expected/subselect.out
+++ b/src/test/regress/expected/subselect.out
@@ -830,6 +830,20 @@ explain (verbose, costs off)
  One-Time Filter: ("*VALUES*".column1 = "*VALUES*".column1)
 (8 rows)
 
+--
+-- Check we don't filter NULL outer rows in a NOT IN where the subquery
+-- returns no rows.
+--
+create temp table notinouter (a int);
+create temp table notininner (a int not null);
+insert into notinouter values(null),(1);
+select * from notinouter where a not in(select a from notininner);
+ a 
+---
+  
+ 1
+(2 rows)
+
 --
 -- Check we behave sanely in corner case of empty SELECT list (bug #8648)
 --
diff --git a/src/test/regress/sql/subselect.sql 
b/src/test/regress/sql/subselect.sql
index 8bca1f5d55..48230d4671 100644
--- a/src/test/regress/sql/subselect.sql
+++ b/src/test/regress/sql/subselect.sql
@@ -466,6 +466,16 @@ explain (verbose, costs off)
   select x, x from
 (select (select random() where y=y) as x from (values(1),(2)) v(y)) ss;
 
+--
+-- Check we don't filter NULL outer rows in a NOT IN where the subquery
+-- returns no rows.
+--
+create temp table notinouter (a int);
+create temp table notininner (a int not null);
+insert into notinouter values(null),(1);
+
+select * from notinouter where a not in(select a from notininner);
+
 --
 -- Check we behave sanely in corner case of empty SELECT list (bug #8648)
 --


Re: Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
On Fri, Mar 1, 2019 at 4:51 PM Chapman Flack  wrote:

> On 3/1/19 3:49 PM, Matt Pulver wrote:
>
> > In many applications, I would much rather see calculations carried out
> > via IEEE 754 all the way to the end, with nans and infs, which
> > provides much more useful diagnostic information than an exception that
> > doesn't return any rows at all. As Andres Freund pointed out, it is also
> > more expensive to do the intermediate checks. Just let IEEE 754 do its
> > thing! (More directed at the SQL standard than to PostgreSQL.)
>
> I wanted to try this out a little before assuming it would work,
> and there seems to be no trouble creating a trivial domain over
> float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
> operators whose operand types are the domain type.
>
> So it seems an extension could easily do that, and supply happily
> inf-returning and NaN-returning versions of the operators and
> functions, and those will be used whenever operands have the domain
> type.
>
> It might even be useful and relatively elegant, while leaving the
> SQL-specified base types to have the SQL-specified behavior.
>

That would be very useful. I've been wanting this for years, and I'm sure
the data users I work with will appreciate it (but don't directly
understand this to be the solution).

There are issues relating to ordering and aggregation that perhaps are
already transparent to you, but I'll mention anyway for the record.
Conceptually, there would be different contexts of ordering:

   1. When writing mathematical functions, <, =, and > are all false when
   comparing to NaN (NaN != NaN is true.)
   2. In SQL when sorting or aggregating, NaN=NaN. Consider that there are
   2^53-2 different double precision representations of NaN at the bit level.
   Under the same floating point ordering logic used for finite numbers, when
   applied to inf and nan, we get the following ordering: -nan < -inf < (all
   finite numbers) < inf < nan. When the bit patterns are taken into
   consideration, an efficient sort algorithm can be implemented. (Forgive me
   for stating the obvious, but just mentioning this for whoever is going to
   take this on.)

I would be most interested to hear of and discuss any other unforeseen
complications or side-effects.

Best regards,
Matt


Re: Tighten error control for OpenTransientFile/CloseTransientFile

2019-03-01 Thread Joe Conway
On 2/28/19 9:33 PM, Michael Paquier wrote:
> Hi all,
> 
> Joe's message here has reminded me that we have lacked a lot of error
> handling around CloseTransientFile():
> https://www.postgresql.org/message-id/c49b69ec-e2f7-ff33-4f17-0eaa4f2ce...@joeconway.com
> 
> This has been mentioned by Alvaro a couple of months ago (cannot find
> the thread about that at quick glance), and I just forgot about it at
> that time.  Anyway, attached is a patch to do some cleanup for all
> that:
> - Switch OpenTransientFile to read-only where sufficient.
> - Add more error handling for CloseTransientFile
> A major take of this patch is to make sure that the new error messages
> generated have an elevel consistent with their neighbors.
> 
> Just on time for this last CF.  Thoughts?

Seems like it would be better to modify the arguments to
CloseTransientFile() to include the filename being closed, errorlevel,
and fail_on_error or something similar. Then all the repeated ereport
stanzas could be eliminated.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Drop type "smgr"?

2019-03-01 Thread Thomas Munro
On Fri, Mar 1, 2019 at 9:11 PM Konstantin Knizhnik
 wrote:
> One more thing... From my point of view one of the drawbacks of Postgres
> is that it requires underlaying file system and is not able to work with
> raw partitions.
> It seems to me that bypassing fle system layer can significantly improve
> performance and give more possibilities for IO performance tuning.
> Certainly it will require a log of changes in Postgres storage layer so
> this is not what I suggest to implement or even discuss right now.
> But it may be useful to keep it in mind in discussions concerning
> "generic storage manager".

Hmm.  Speculation-around-the-water-cooler-mode:  I think the arguments
for using raw partitions are approximately the same as the arguments
for using a big data file that holds many relations.  The three I can
think of are (1) the space is entirely preallocated, which *might*
have performance and safety advantages, (2) if you encrypt it, no one
can see the structure (database OIDs, relation OIDs and sizes) and (3)
it might allow pages to be moved from one relation to another without
copying or interleaved in interesting ways (think SQL Server parallel
btree creation that "stitches together" btrees produced by parallel
workers, or Oracle "clustered" tables where the pages of two tables
are physically interleaved in an order that works nicely when you join
those two tables, or perhaps schemes for moving data between
relations/partitions quickly).  On the other hand, to make that work
you have to own the problem of space allocation/management that we
currently leave to the authors of XFS et al, and those guys have
worked on that for *years and years* and they work really well.  If
you made all that work for big preallocated data files, then sure, you
could also make it work for raw partitions, but I'm not sure how much
performance advantage there is for that final step.  I suspect that a
major reason for Oracle to support raw block devices many years ago
was because back then there was no other way to escape from the page
cache.  Direct IO hasn't always been available or portable, and hasn't
always worked well.  That said, it does seem plausible that we could
do the separation of (1) block -> pathname/offset mappings and (2)
actual IO operations in a way that you could potentially write your
own pseudo-filesystem that stores a whole PostgreSQL cluster inside
big data files or raw partitions.  Luckily we don't need to tackle
such mountainous terrain to avoid the page cache, today.

-- 
Thomas Munro
https://enterprisedb.com



Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 3:49 PM, Matt Pulver wrote:

> In many applications, I would much rather see calculations carried out
> via IEEE 754 all the way to the end, with nans and infs, which
> provides much more useful diagnostic information than an exception that
> doesn't return any rows at all. As Andres Freund pointed out, it is also
> more expensive to do the intermediate checks. Just let IEEE 754 do its
> thing! (More directed at the SQL standard than to PostgreSQL.)

I wanted to try this out a little before assuming it would work,
and there seems to be no trouble creating a trivial domain over
float8 (say, CREATE DOMAIN ieeedouble AS float8), and then creating
operators whose operand types are the domain type.

So it seems an extension could easily do that, and supply happily
inf-returning and NaN-returning versions of the operators and
functions, and those will be used whenever operands have the domain
type.

It might even be useful and relatively elegant, while leaving the
SQL-specified base types to have the SQL-specified behavior.

-Chap



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

I thought the idea was that each smgr might have a different set of
requests.  If they're all going to have the same set of requests then
I agree with you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 3:52 PM Haribabu Kommi  wrote:
> The Cybertec proposed patches are doing the encryption at the instance
> level, AFAIK, the current discussion is also trying to reduce the scope of the
> encryption to object level like (tablesapce, database or table) to avoid the 
> encryption
> performance impact for the databases, tables that don't need it.

The trick there is that it becomes difficult to figure out which keys
to use for certain things.  For example, you could say, well, this WAL
record is for a table that is encrypted with key 123, so let's use key
123 to encrypt the WAL record also.  So far, so good.  But then how do
you encrypt, say, a logical decoding spill file?  That could have data
in it mixed together from multiple relations, IIUC.  Or what do you do
about SLRUs or other global structures?  If you just exclude that
stuff from the scope of encryption, then you aren't helping the people
who want to Just Encrypt Everything.

Now that having been said I bet a lot of people would find it pretty
cool if we could make this work on a per-table basis.  And I'm not
opposed to that.  I just think it's really hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: "WIP: Data at rest encryption" patch and, PostgreSQL 11-beta3

2019-03-01 Thread Robert Haas
On Fri, Sep 14, 2018 at 10:04 AM Antonin Houska  wrote:
> Toshi Harada  wrote:
> > Even if you apply "data-at-rest-encryption-wip-2018.07.25.patch" to the 
> > master branch,
> > the same patch error will occur.
>
> The version attached to this email should be applicable to the current
> branch. Sorry for the delay.

I have a few comments on this patch.  It doesn't seem useful to go
through and make really detailed review comments on everything at this
stage, because it's pretty far from being committable, at least IMHO.
However, I have a few general thoughts.

First of all, this is an impressive piece of work.  It's obvious even
from looking at this quickly that a lot of energy has gone into making
this work, and it touches a lot of stuff, not without reason.  For
example, it's quite impressive that this touches things like the
write-ahead log, logical decoding, and the stats collector, all of
which write to disk.

However, this is also quite invasive.  It changes a lot of code and it
doesn't do so in a very predictable way.  It's not like you went
through and replaced every call to write() with a call to
SpecialEncyptionMagicWrite().  Rather, there are new #ifdef
USE_ENCRYPTION blocks all over the code base.  I think it will be
important to look for ways to refactor this functionality to reduce
that sort of thing to the absolute minimum possible.  Encryption needs
to be something that the code needs to know about here and there, but
not everywhere.

Some of the changes are things that could perhaps be done as separate,
preparatory patches.  For instance, pgstat.c gets a heavy refactoring
to make it do I/O in bigger chunks.  There's no reason that you
couldn't separate some of those changes out, clean them up, and get
them committed separately.  It would be more efficient even as things
stand.  OK, well, there is one reason: we may be getting a
shared-memory stats collector soon, and if we do, then the need for
all of this will go away.  But the general point is valid: whatever is
a useful cleanup apart from this patch should be done separately from
this patch.

As far as possible, we need to try to do things in the same way in the
encrypted and non-encrypted cases.  For example, it's pretty hard to
feel good about code like this:

+ sz_hdr = sizeof(ReorderBufferDiskChange);
+ if (data_encrypted)
+#ifdef USE_ENCRYPTION
+ sz_hdr = TYPEALIGN(ENCRYPTION_BLOCK, sz_hdr);
+#else
+ ENCRYPTION_NOT_SUPPORTED_MSG;
+#endif /* USE_ENCRYPTION */

You won't be able to have much confidence that this code works both
with and without encryption unless you test it both ways, because the
file format is different in those two cases, and not just by being
encrypted.  That means that anyone who modifies this code in the
future has two ways of breaking it.  They can break the normal case,
or they can break the encrypted case.  If encryption were available as
a sort of service that everyone could use, then that would probably be
fine, but like I said above, I think something as invasive as this
currently is will lead to a lot of complaints.

The code needs a lot more documentation, not only SGML documentation
but also code comments and probably a README file someplace.  There is
a lot of discussion in the comments here of encryption tweaks, but
there's no general introduction to the topic (what's a tweak?) or --
and I think this is important -- what the idea between the choice of
various tweaks in different places actually is.  There's probably some
motivating philosophy behind the way the tweaks are chosen that should
be explained someplace.  Think about it this way: if some hapless
developer, let's say me, wants to add a new module to PostgreSQL which
happens to write data to disk, that person needs an easy way to
understand what they need to do to make that new code play nice with
encryption.  Right now, the code for every existing module is a little
different (uggh) and there's this encryption tweak stuff that you
about which you have to do something intelligent, but there's nothing
that tells you how to be intelligent about it, at least not that I
see.  People are going to need something that looks more like a
formula that they can just copy, and a good introduction to the
principles in some appropriate place, too.

The interaction of this capability with certain tricks that PostgreSQL
plays needs some thought -- and documentation, at least developer
documentation, maybe user documentation.  One area is recovery.
Writing WAL relies on the fact that if you are in the midst of
rewriting a block and the power fails, bytes that are the same in both
the old and new block images will be undisturbed; encryption will make
that false.  How's that going to work?  Hint bits also rely on this
principle.  I thought there might be some interaction between this
work and wal_log_hints for this reason, but I see nothing of that sort
in the patch.  Full page writes seem related too; I don't know how
something like this can be safe without bo

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Haribabu Kommi
On Sat, Mar 2, 2019 at 7:27 AM Robert Haas  wrote:

> On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada 
> wrote:
> > WAL encryption will follow as an additional feature.
>
> I don't think WAL encryption is an optional feature.  You can argue
> about whether it's useful to encrypt the disk files in the first place
> given that there's no privilege boundary between the OS user and the
> database, but a lot of people seem to think it is and maybe they're
> right.  However, who can justify encrypting only SOME of the disk
> files and not others?  I mean, maybe you could justify not encryption
> the SLRU files on the grounds that they probably don't leak much in
> the way of interesting information, but the WAL files certainly do --
> your data is there, just as much as in the data files themselves.
>

+1

WAL encryption is not optional, it must be encrypted.



> To be honest, I think there is a lot to like about the patches
> Cybertec has proposed.  Those patches don't have all of the fancy
> key-management stuff that you are proposing here, but maybe that
> stuff, if we want it, could be added, rather than starting over from
> scratch.  It seems to me that those patches get a lot of things right.
> In particular, it looked to me when I looked at them like they made a
> pretty determined effort to encrypt every byte that might go down to
> the disk.  It seems to me that you if you want encryption, you want
> that.
>


The Cybertec proposed patches are doing the encryption at the instance
level, AFAIK, the current discussion is also trying to reduce the scope of
the
encryption to object level like (tablesapce, database or table) to avoid
the encryption
performance impact for the databases, tables that don't need it.

IMO, the entire performance of encryption depends on WAL encryption, whether
we choose instance level or object level encryption.

As WAL is not an optional encryption, even if the encryption is set to
object
level, the corresponding objects WAL needs to be encrypted, so this should
be
done at the WAL insertion not at WAL write to disk, because some entries are
not encrypted and some not. Or may be something like encrypting entire WAL
even if one object is set for encryption.


Regards,
Haribabu Kommi
Fujitsu Australia


Re: Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
On Fri, Mar 1, 2019 at 12:59 PM Andrew Gierth 
wrote:

> > "Matt" == Matt Pulver  writes:
>
>  Matt> ERROR:  division by zero
>
>  Matt> Question: If Infinity and NaN are supported, then why throw an
>  Matt> exception here, instead of returning Infinity?
>
> Spec says so:
>
>   4) The dyadic arithmetic operators , ,
>  , and  (+, -, *, and /, respectively) specify
>  addition, subtraction, multiplication, and division, respectively.
>  If the value of a divisor is zero, then an exception condition is
>  raised: data exception -- division by zero.


Thank you, that is what I was looking for. In case anyone else is looking
for source documentation on the standard, there is a link from
https://en.wikipedia.org/wiki/SQL:2003#Documentation_availability to a zip
file of the SQL 2003 draft http://www.wiscorp.com/sql_2003_standard.zip
where one can confirm this (page 242 of 5WD-02-Foundation-2003-09.pdf).


On Fri, Mar 1, 2019 at 2:26 PM David G. Johnston 
wrote:

> On Friday, March 1, 2019, Chapman Flack  wrote:
>
>>
>> But if someone wanted to write a user-defined division function or
>> operator that would return Inf for (anything > 0) / 0 and for
>> (anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
>> (anything > 0) / -0, and NaN for (either zero) / (either zero), I think
>> that function or operator would be fully in keeping with IEEE 754.
>>
>
> Upon further reading you are correct - IEEE 754 has chosen to treat n/0
> differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
> within the scope of this database, and the core arithmetic functions it
> provides, those distinctions don't seeming meaningful and having to add
> query logic to deal with both cases would just be annoying.  I don't use,
> or have time for the distraction, to understand why such a decision was
> made and how it could be useful.  Going from an exception to NaN makes
> sense to me, going instead to infinity - outside of limit expressions which
> aren't applicable here - does not.
>
> For my part in the queries I have that encounter divide-by-zero I end up
> transforming the result to zero which is considerably easier to
> present/absorb along side other valid fractions in a table or chart.
>

In heavy financial/scientific calculations with tables of data, using inf
and nan are very useful, much more so than alternatives such as throwing an
exception (which row(s) included the error?), or replacing them with NULL
or 0. There are many intermediate values where using inf makes sense and
results in finite outcomes at the appropriate limit: atan(1.0/0)=pi/2,
erf(1.0/0)=1, exp(-1.0/0)=0, etc.

In contrast, nan represents a mathematically indeterminate form, in which
the appropriate limit could not be ascertained. E.g. 0.0/0, inf-inf,
0.0*inf, etc. In many applications, I would much rather see calculations
carried out via IEEE 754 all the way to the end, with nans and infs, which
provides much more useful diagnostic information than an exception that
doesn't return any rows at all. As Andres Freund pointed out, it is also
more expensive to do the intermediate checks. Just let IEEE 754 do its
thing! (More directed at the SQL standard than to PostgreSQL.)

Best regards,
Matt


Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Thomas Munro
On Sat, Mar 2, 2019 at 9:35 AM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:
> > On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> > > I disagree, at least with combining and retaining enums. Encoding all
> > > the possible request types with the current, planned and future SMGRs
> > > would cause a sheer explosion in the number of enum  values.
> >
> > How big of an explosion would it be?
>
> 4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
> total. Any future smgr addition will expand this further.

It's not so much the "explosion" that bothers me.  I think we should
have a distinct sync requester enum, because we need a way to index
into the table of callbacks.  How exactly you pack the two enums into
compact space seems like a separate question; doing it with two words
would obviously be wasteful, but it should be possible stuff them into
(say) a single uint8_t, uint16_t or whatever will pack nicely in the
request struct and allow the full range of request types (4?) + the
full range of sync requesters (which we propose to expand to 3 in the
forseeable future).  Now perhaps the single enum idea was going to
involve explicit values that encode the two values SYNC_REQ_CANCEL_MD
= 0x1 | (0x04 << 4) so you could still extract the requester part, but
that's just the same thing with uglier code.

-- 
Thomas Munro
https://enterprisedb.com



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Peter Geoghegan
On Fri, Mar 1, 2019 at 11:56 AM Robert Haas  wrote:
> It would be neat if there were a tool you could run to somehow tell
> you whether catversion needs to be changed for a given patch.

That seems infeasible because of stored rules. A lot of things bleed
into that. We could certainly do better at documenting this on the
"committing checklist" page, though.

> Indeed, Simon got complaints a number of years ago (2010, it looks
> like) when he had the temerity to change the magic number to some
> other unrelated value instead of just incrementing it by one.

Off with his head!

> Although I think that the criticism was to a certain extent
> well-founded -- why deviate from previous practice? -- there is at the
> same time something a little crazy about somebody getting excited
> about the particular value that has been chosen for a number that is
> described in the very name of the constant as a MAGIC number.  And
> especially because there is absolutely zip in the way of code comments
> or a README that explain to you how to do it "right."

I have learned to avoid ambiguity more than anything else, because
ambiguity causes patches to flounder indefinitely, whereas it's
usually not that hard to fix something that's broken. I agree -
anything that adds ambiguity rather than taking it away is a big
problem.

> Well, perhaps I'm proposing some additional code, but I don't think of
> that as making the mechanism more complicated.  I want to make it
> simpler for patch submitters and reviewers and committers to not make
> mistakes that they have to run around and fix.

Right. So do I. I just don't think that it's that bad to ask the final
committer to do something once, rather than getting everyone else
(including committers) to do it multiple times. If we can avoid even
this burden, and totally centralize the management of the OID space,
then so much the better.

> If there are fewer
> kinds of things that qualify as mistakes, as in Tom's latest proposal,
> then we are moving in the right direction IMO regardless of anything
> else.

I'm glad that we now have a plan that is a clear step forward.

> I think Peter and I are more agreeing than we are at the opposite ends
> of a spectrum, but more importantly, I think it is worth having a
> discussion first about what people like and dislike, and what goals
> they have, and then only if necessary, counting the votes afterwards.

I agree that that's totally worthwhile.

> I don't like having the feeling that because I have a different view
> of something and want to write an email about that, I am somehow an
> impediment to progress.  I think if we reduce discussions to
> you're-for-it-or-your-against-it, that's not that helpful.

That was not my intention. The way that you brought the issue of the
difficulty of being a contributor in general into it was unhelpful,
though. It didn't seem useful or fair to link Tom's position to a big,
well known controversy.

We now have a solution that everyone is happy with, or can at least
live with, which suggests to me that Tom wasn't being intransigent or
insensitive to the concerns of contributors.

-- 
Peter Geoghegan



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-01 Thread Robert Haas
On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada  wrote:
> WAL encryption will follow as an additional feature.

I don't think WAL encryption is an optional feature.  You can argue
about whether it's useful to encrypt the disk files in the first place
given that there's no privilege boundary between the OS user and the
database, but a lot of people seem to think it is and maybe they're
right.  However, who can justify encrypting only SOME of the disk
files and not others?  I mean, maybe you could justify not encryption
the SLRU files on the grounds that they probably don't leak much in
the way of interesting information, but the WAL files certainly do --
your data is there, just as much as in the data files themselves.

To be honest, I think there is a lot to like about the patches
Cybertec has proposed.  Those patches don't have all of the fancy
key-management stuff that you are proposing here, but maybe that
stuff, if we want it, could be added, rather than starting over from
scratch.  It seems to me that those patches get a lot of things right.
In particular, it looked to me when I looked at them like they made a
pretty determined effort to encrypt every byte that might go down to
the disk.  It seems to me that you if you want encryption, you want
that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath  wrote:
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values.

How big of an explosion would it be?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 2:26 PM, David G. Johnston wrote:

> Upon further reading you are correct - IEEE 754 has chosen to treat n/0
> differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
> ... I don't use,
> or have time for the distraction, to understand why such a decision was
> made and how it could be useful.

The answer may be as simple as the inherent difference between
the cases.

0/0 is funny because of a uniqueness problem. Try to name q such that
0/0 = q, rewritten as q × 0 = 0, and the problem you run into is that
that's true for any value of q. So you would have to make some
completely arbitrary decision to name any value at all as "the" result.

(anything nonzero)/0 is funny because of a representability problem.
n/0 = q, rewritten as q × 0 = n, only has the problem that it's
untrue for every finite value q; they're never big enough. Calling the
result infinity is again a definitional decision, but this time it
is not an arbitrary one among multiple equally good choices; all
finite choices are ruled out, and the definitional choice is fully
consistent with what you see happening as a divisor *approaches* zero.

> For my part in the queries I have that encounter divide-by-zero I end up
> transforming the result to zero which is considerably easier to
> present

Easy to present it may be, but it lacks the mathematical
motivation behind the choice IEEE made ... as a value for q, zero
fails the q × 0 = n test fairly convincingly for nonzero n. :)

Regards,
-Chap



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Thomas Munro
On Sat, Mar 2, 2019 at 8:36 AM Shawn Debnath  wrote:
> On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:
> > > >
> > > > +typedef enum syncrequestowner
> > > > +{
> > > > +   SYNC_MD = 0 /* md smgr */
> > > > +} syncrequestowner;
> > > >
> > > > I have a feeling that Andres wanted to see a single enum combining
> > > > both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> > > > SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> > > > have it.
> > >
> > > Obviously it's nicer looking this way, but OTOH, that means we have to
> > > send more data over the queue, because we can't easily combine the
> > > request + "owner". I don't have too strong feelings about it though.
> >
> > Yeah, I would lean toward combining those.
>
> I disagree, at least with combining and retaining enums. Encoding all
> the possible request types with the current, planned and future SMGRs
> would cause a sheer explosion in the number of enum  values. Not to
> mention that you have multiple enum values for the same behavior - which
> just isn't clean. And handling of these enums in the code would be ugly
> too.
>
> Do note that these are typedef'ed to uint8 currently. For a default
> config with 128 MB shared_buffers, we will use an extra 16kB (one extra
> byte to represent the owner).  I am hesitant to change this right now
> unless folks feel strongly about it.
>
> If so, I would combine the type and owner by splitting it up in 4 bit
> chunks, allowing for 16 request types and 16 smgrs. This change would
> only apply for the in-memory queue. The code and functions would
> continue using the enums.

+1

-- 
Thomas Munro
https://enterprisedb.com



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 5:36 PM Peter Geoghegan  wrote:
> I have attempted to institute some general guidelines for what the
> thicket of rules are by creating the "committing checklist" page. This
> is necessarily imperfect, because the rules are in many cases open to
> interpretation, often for good practical reasons. I don't have any
> sympathy for committers that find it hard to remember to do a
> catversion bump with any kind of regularity. That complexity seems
> inherent, not incidental, since it's often convenient to ignore
> catalog incompatibilities during development.

Bumping catversion is something of a special case, because one does it
so often that one gets used to remembering it.  The rules are usually
not too hard to remember, although they are trickier when you don't
directly change anything in src/include/catalog but just change the
definition of some node that may be serialized in a catalog someplace.
It would be neat if there were a tool you could run to somehow tell
you whether catversion needs to be changed for a given patch.

But you know there are a log of other version numbers floating around,
like XLOG_PAGE_MAGIC or the pg_dump archive version, and it is not
really that easy to know -- as a new contributor or sometimes even as
an experienced one -- whether your work requires any changes to that
stuff, or even that that stuff *exists*.  Indeed, XLOG_PAGE_MAGIC is a
particularly annoying case, both because the constant name doesn't
contain VERSION and because the comment just says /* can be used as
WAL version indicator */ which does not exactly make it clear that if
you fail to bump it when you touch the WAL format you will Make People
Unhappy.

Indeed, Simon got complaints a number of years ago (2010, it looks
like) when he had the temerity to change the magic number to some
other unrelated value instead of just incrementing it by one.
Although I think that the criticism was to a certain extent
well-founded -- why deviate from previous practice? -- there is at the
same time something a little crazy about somebody getting excited
about the particular value that has been chosen for a number that is
described in the very name of the constant as a MAGIC number.  And
especially because there is absolutely zip in the way of code comments
or a README that explain to you how to do it "right."

> > So, I suspect that for every
> > unit of work it saves somebody, it's probably going to generate about
> > one unit of extra work for somebody else.
>
> Maybe so. I think that you're jumping to conclusions, though.

I did say "I suspect..." which was intended as a concession that I
don't know for sure.

> But you seem to want to make the mechanism itself even more
> complicated, not less complicated (based on your remarks about making
> OID assignment happen during the build). In order to make the use of
> the mechanism easier. That seems worth considering, but ISTM that this
> is talking at cross purposes. There are far simpler ways of making it
> unlikely that a committer is going to miss this step. There is also a
> simple way of noticing that they do quickly (e.g. a simple buildfarm
> test).

Well, perhaps I'm proposing some additional code, but I don't think of
that as making the mechanism more complicated.  I want to make it
simpler for patch submitters and reviewers and committers to not make
mistakes that they have to run around and fix.  If there are fewer
kinds of things that qualify as mistakes, as in Tom's latest proposal,
then we are moving in the right direction IMO regardless of anything
else.

> > OK.  Well, I think that doing nothing is superior to this proposal,
> > for reasons similar to what Peter Eisentraut has already articulated.
> > And I think rather than blasting forward with your own preferred
> > alternative in the face of disagreement, you should be willing to
> > discuss other possible options.  But if you're not willing to do that,
> > I can't make you.
>
> Peter seemed to not want to do this on the grounds that it isn't
> necessary at all, whereas you think that it doesn't go far enough. If
> there is a consensus against what Tom has said, it's a cacophonous one
> that cannot really be said to be in favor of anything.

I think Peter and I are more agreeing than we are at the opposite ends
of a spectrum, but more importantly, I think it is worth having a
discussion first about what people like and dislike, and what goals
they have, and then only if necessary, counting the votes afterwards.
I don't like having the feeling that because I have a different view
of something and want to write an email about that, I am somehow an
impediment to progress.  I think if we reduce discussions to
you're-for-it-or-your-against-it, that's not that helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Why don't we have a small reserved OID range for patch revisions?

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 6:40 PM Tom Lane  wrote:
> 1. Encourage people to develop new patches using chosen-at-random
> high OIDs, in the 7K-9K range.  They do this already, it'd just
> be encouraged instead of discouraged.
>
> 2. Commit patches as received.
>
> 3. Once each devel cycle, after feature freeze, somebody uses the
> renumbering tool to shove all the new OIDs down to lower numbers,
> freeing the high-OID range for the next devel cycle.  We'd have
> to remember to do that, but it could be added to the RELEASE_CHANGES
> checklist.

Sure, that sounds nice.  It seems like it might be slightly less
convenient for non-committers than what I was proposing, but still
more convenient than what they're doing right now.  And it's also more
convenient for committers, because they're not being asked to manually
fiddle patches at the last moment, something that I at least find
rather error-prone.  It also, and I think this is really good, moves
in the direction of fewer things for both patch authors and patch
committers to worry about doing wrong.  Instead of throwing rocks at
people whose OID assignments are "wrong," we just accept what people
do and adjust it later if it makes sense to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andrew Dunstan  writes:
> On 3/1/19 2:14 PM, Tom Lane wrote:
>> Indeed, but I'm not sure that the use-cases are the same.  In particular,
>> unless somebody has done some rather impossible magic, it would be
>> disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
>> it would be persistent and you'd never get a real vacuum operation and
>> soon your disk would be full.  Permanently applying truncation disabling
>> seems less insane.

> You could allow an explicitly set command option to override the reloption.
> It's important for us to be able to control the vacuum phases more. In
> particular, the index cleanup phase can have significant system impact
> but often doesn't need to be done immediately.

I'm not objecting to having a manual command option to skip index cleanup
(which basically reduces to "do nothing but tuple freezing", right?
maybe it should be named/documented that way).  Applying it as a reloption
seems like a foot-gun, though.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> On 2019-03-01 14:17:33 -0500, Tom Lane wrote:
>> I think we should reject the whole patch, tbh, and go do something
>> about the underlying problem instead.  Once we've made truncation
>> not require AEL, this will be nothing but a legacy wart that we'll
>> have a hard time getting rid of.

> IDK, it's really painful in the field, and I'm not quite seeing us
> getting rid of the AEL for v12.

Dunno, I was musing about it just yesterday, in
https://postgr.es/m/1261.1551392...@sss.pgh.pa.us

I'd sure rather spend time making that happen than this.  I'm also
not entirely convinced that we MUST do something about this in v12
rather than v13 --- we've been living with it ever since we had
in-core replication, why's it suddenly so critical?

> I think it's a wart, but one that works
> around a pretty important usability issue. And I think we should just
> remove the GUC without any sort of deprecation afterwards, if necessary
> we can add a note to the docs to that effect.  It's not like preventing
> truncation from happening is a very intrusive / dangerous thing to do.

Well, if we add a reloption then we can never ever get rid of it; at
best we could ignore it.  So from the perspective of how-fast-can-we-
deprecate-this, maybe a GUC is the better answer.  On the other hand,
I'm not sure I believe that many installations could afford to disable
truncation for every single table.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andrew Dunstan


On 3/1/19 2:14 PM, Tom Lane wrote:
> Robert Haas  writes:
>> I want to make one other point about this patch, which is that over on
>> the thread "New vacuum option to do only freezing" we have a patch
>> that does a closely-related thing.  Both patches skip one phase of the
>> overall VACUUM process.  THIS patch wants to skip truncation; THAT
>> patch wants to skip index cleanup.  Over there, we seem to have
>> settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
>> -- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
>> only available as a reloption.
>> Now that seems not very consistent.
> Indeed, but I'm not sure that the use-cases are the same.  In particular,
> unless somebody has done some rather impossible magic, it would be
> disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
> it would be persistent and you'd never get a real vacuum operation and
> soon your disk would be full.  Permanently applying truncation disabling
> seems less insane.
>
> The gratuitously inconsistent spellings should be harmonized, for sure.
>
>   


You could allow an explicitly set command option to override the reloption.


It's important for us to be able to control the vacuum phases more. In
particular, the index cleanup phase can have significant system impact
but often doesn't need to be done immediately.


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Infinity vs Error for division by zero

2019-03-01 Thread David G. Johnston
On Friday, March 1, 2019, Chapman Flack  wrote:

>
> But if someone wanted to write a user-defined division function or
> operator that would return Inf for (anything > 0) / 0 and for
> (anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
> (anything > 0) / -0, and NaN for (either zero) / (either zero), I think
> that function or operator would be fully in keeping with IEEE 754.
>

Upon further reading you are correct - IEEE 754 has chosen to treat n/0
differently for n=0 and n<>0 cases.  I'm sure they have their reasons but
within the scope of this database, and the core arithmetic functions it
provides, those distinctions don't seeming meaningful and having to add
query logic to deal with both cases would just be annoying.  I don't use,
or have time for the distraction, to understand why such a decision was
made and how it could be useful.  Going from an exception to NaN makes
sense to me, going instead to infinity - outside of limit expressions which
aren't applicable here - does not.

For my part in the queries I have that encounter divide-by-zero I end up
transforming the result to zero which is considerably easier to
present/absorb along side other valid fractions in a table or chart.

David J.


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 14:17:33 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > OTOH, as the main reason for wanting to disable truncation is that a
> > user is getting very undesirable HS conflicts, it doesn't seem right to
> > force them to change the reloption on all tables, and then somehow force
> > it to be set on all tables created at a later stage. I'm not sure how
> > that'd be better?
> 
> I think we should reject the whole patch, tbh, and go do something
> about the underlying problem instead.  Once we've made truncation
> not require AEL, this will be nothing but a legacy wart that we'll
> have a hard time getting rid of.

IDK, it's really painful in the field, and I'm not quite seeing us
getting rid of the AEL for v12. I think it's a wart, but one that works
around a pretty important usability issue. And I think we should just
remove the GUC without any sort of deprecation afterwards, if necessary
we can add a note to the docs to that effect.  It's not like preventing
truncation from happening is a very intrusive / dangerous thing to do.

Greetings,

Andres Freund



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> OTOH, as the main reason for wanting to disable truncation is that a
> user is getting very undesirable HS conflicts, it doesn't seem right to
> force them to change the reloption on all tables, and then somehow force
> it to be set on all tables created at a later stage. I'm not sure how
> that'd be better?

I think we should reject the whole patch, tbh, and go do something
about the underlying problem instead.  Once we've made truncation
not require AEL, this will be nothing but a legacy wart that we'll
have a hard time getting rid of.

regards, tom lane



Re: [HACKERS] CLUSTER command progress monitor

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 11:54 PM Tatsuro Yamada
 wrote:
> Attached patch is wip patch.

+   CLUSTER and VACUUM FULL,
showing current progress.

and -> or

+   certain commands during command execution.  Currently, the suppoted
+   progress reporting commands are VACUUM and
CLUSTER.

suppoted -> supported

But I'd just say: Currently, the only commands which support progress
reporting are VACUUM and
CLUSTER.

+   Running VACUUM FULL is listed in
pg_stat_progress_cluster
+   view because it uses CLUSTER command
internally.  See .

How about: Running VACUUM FULL is listed in
pg_stat_progress_cluster because both
VACUUM FULL and CLUSTER rewrite
the table, while regular VACUUM only modifies it in
place.

+   Current processing command: CLUSTER/VACUUM FULL.

The command that is running.  Either CLUSTER or VACUUM FULL.

+   Current processing phase of cluster/vacuum full.  See  or .

Current processing phase of CLUSTER or VACUUM FULL.

Or maybe better, just abbreviate to: Current processing phase.

+   Scan method of table: index scan/seq scan.

Eh, shouldn't this be gone now?  And likewise for the view definition?

+   OID of the index.

If the table is being scanned using an index, this is the OID of the
index being used; otherwise, it is zero.

+ heap_tuples_total

Leftovers.  Skipping over the rest of your documentation changes since
it looks like a bunch of things there still need to be updated.

+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);

This now appears inside cluster_rel(), but also vacuum_rel() is still
doing the same thing.  That's wrong.

+ if(OidIsValid(indexOid))

Missing space.  Please pgindent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Tom Lane
Robert Haas  writes:
> I want to make one other point about this patch, which is that over on
> the thread "New vacuum option to do only freezing" we have a patch
> that does a closely-related thing.  Both patches skip one phase of the
> overall VACUUM process.  THIS patch wants to skip truncation; THAT
> patch wants to skip index cleanup.  Over there, we seem to have
> settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
> -- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
> only available as a reloption.

> Now that seems not very consistent.

Indeed, but I'm not sure that the use-cases are the same.  In particular,
unless somebody has done some rather impossible magic, it would be
disastrous to apply DISABLE_INDEX_CLEANUP as a reloption, because then
it would be persistent and you'd never get a real vacuum operation and
soon your disk would be full.  Permanently applying truncation disabling
seems less insane.

The gratuitously inconsistent spellings should be harmonized, for sure.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andres Freund
Hi,

On 2019-02-27 10:55:49 -0500, Robert Haas wrote:
> I don't think that a VACUUM option would be out of place, but a GUC
> sounds like an attractive nuisance to me.  It will encourage people to
> just flip it blindly instead of considering the particular cases where
> they need that behavior, and I think chances are good that most people
> who do that will end up being sad.

OTOH, as the main reason for wanting to disable truncation is that a
user is getting very undesirable HS conflicts, it doesn't seem right to
force them to change the reloption on all tables, and then somehow force
it to be set on all tables created at a later stage. I'm not sure how
that'd be better?

Greetings,

Andres Freund



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Andrew Dunstan


On 3/1/19 1:43 PM, Robert Haas wrote:
> On Thu, Feb 28, 2019 at 3:17 AM Tsunakawa, Takayuki
>  wrote:
>> Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
>> I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
>> maker...
> My proposal would be that we make both options available as both
> reloptions and vacuum options.  


+many


cheers


andrew



-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns

2019-03-01 Thread Andres Freund
Hi,

On 2019-02-28 10:18:36 -0800, Andres Freund wrote:
> On 2019-02-28 18:28:37 +0900, Etsuro Fujita wrote:
> > > I'm currently
> > > converting the EPQ machinery to slots, and in course of that I (with
> > > Horiguchi-san's help), converted RefetchForeignRow to return a slot. But
> > > there's currently no in-core user of this facility...  I guess I can
> > > rebase the preliminary postgres_fdw patch here, but it bitrotted
> > > significantly.
> > 
> > I'll rebase that patch and help the testing, if you want me to.
> 
> That'd be awesome.

FWIW, I pushed the EPQ patch, doing this conversion blindly. It'd be
awesome if you'd check that it actually works...

Thanks,

Andres



Re: [HACKERS] Block level parallel vacuum

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 12:19 AM Masahiko Sawada  wrote:
> > I wonder if we really want this behavior.  Should a setting that
> > controls the degree of parallelism when scanning the table also affect
> > VACUUM?  I tend to think that we probably don't ever want VACUUM of a
> > table to be parallel by default, but rather something that the user
> > must explicitly request.  Happy to hear other opinions.  If we do want
> > this behavior, I think this should be written differently, something
> > like this: The PARALLEL N option to VACUUM takes precedence over this
> > option.
>
> For example, I can imagine a use case where a batch job does parallel
> vacuum to some tables in a maintenance window. The batch operation
> will need to compute and specify the degree of parallelism every time
> according to for instance the number of indexes, which would be
> troublesome. But if we can set the degree of parallelism for each
> tables it can just to do 'VACUUM (PARALLEL)'.

True, but the setting in question would also affect the behavior of
sequential scans and index scans.  TBH, I'm not sure that the
parallel_workers reloption is really a great design as it is: is
hard-coding the number of workers really what people want?  Do they
really want the same degree of parallelism for sequential scans and
index scans?  Why should they want the same degree of parallelism also
for VACUUM?  Maybe they do, and maybe somebody explain why they do,
but as of now, it's not obvious to me why that should be true.

> Since the parallel vacuum uses memory in the same manner as the single
> process vacuum it's not deteriorated. I'd agree that that patch is
> more smarter and this patch can be built on top of it but I'm
> concerned that there two proposals on that thread and the discussion
> has not been active for 8 months. I wonder if  it would be worth to
> think of improving the memory allocating based on that patch after the
> parallel vacuum get committed.

Well, I think we can't just say "oh, this patch is going to use twice
as much memory as before," which is what it looks like it's doing
right now. If you think it's not doing that, can you explain further?

> Agreed. I'll separate patches and propose it.

Cool.  Probably best to keep that on this thread.

> The main motivations are refactoring and improving readability but
> it's mainly for the previous version patch which implements parallel
> heap vacuum. It might no longer need here. I'll try to implement
> without LVState. Thank you.

Oh, OK.

> > + /*
> > + * If there is already-updated result in the shared memory we use it.
> > + * Otherwise we pass NULL to index AMs, meaning it's first time call,
> > + * and copy the result to the shared memory segment.
> > + */
> >
> > I'm probably missing something here, but isn't the intention that we
> > only do each index once?  If so, how would there be anything there
> > already?  Once from for_cleanup = false and once for for_cleanup =
> > true?
>
> We call ambulkdelete (for_cleanup = false) 0 or more times for each
> index and call amvacuumcleanup (for_cleanup = true) at the end. In the
> first time calling either ambulkdelete or amvacuumcleanup the lazy
> vacuum must pass NULL to them. They return either palloc'd
> IndexBulkDeleteResult or NULL. If they returns the former the lazy
> vacuum must pass it to them again at the next time. In current design,
> since there is no guarantee that an index is always processed by the
> same vacuum process each vacuum processes save the result to DSM in
> order to share those results among vacuum processes. The 'updated'
> flags indicates that its slot is used. So we can pass the address of
> DSM if 'updated' is true, otherwise pass NULL.

Ah, OK.  Thanks for explaining.

> Almost comments I got have been incorporated to the local branch but a
> few comments need discussion. I'll submit the updated version patch
> once I addressed all of comments.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: New vacuum option to do only freezing

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:12 AM Masahiko Sawada  wrote:
> Attached the updated version patch.

Regarding the user interface for this patch, please have a look at the
concerns I mention in

https://www.postgresql.org/message-id/ca+tgmozorx_uuv67rjasx_aswkdzwv8kwfkfrwxyldcqfqj...@mail.gmail.com

I provided a suggestion there as to how to resolve the issue but
naturally others may feel differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 3:17 AM Tsunakawa, Takayuki
 wrote:
> Uh, thanks.  I've just recognized I didn't know the meaning of "nuisance."  
> I've looked up the meaning in the dictionary.  Nuisance is like a trouble 
> maker...

Yes, and "attractive nuisance" means something that, superficially, it
looks like a good idea, but later, you find out that it creates many
problems.

I want to make one other point about this patch, which is that over on
the thread "New vacuum option to do only freezing" we have a patch
that does a closely-related thing.  Both patches skip one phase of the
overall VACUUM process.  THIS patch wants to skip truncation; THAT
patch wants to skip index cleanup.  Over there, we seem to have
settled on DISABLE_INDEX_CLEANUP -- only available as a VACUUM option
-- and here I think the proposal is currently VACUUM_SHRINK_ENABLED --
only available as a reloption.

Now that seems not very consistent.  One can only be set as a
reloption, the other only as a VACUUM option.  One talks about what is
enabled, the other about what is disabled.  One puts enable/disable at
the start of the name, the other at the end.

My proposal would be that we make both options available as both
reloptions and vacuum options.  Call the VACUUM options INDEX_CLEANUP
and TRUNCATE and the default will be true but the user can specify
false.  For the reloptions, prefix "vacuum_", thus
vacuum_index_cleanup = true/false and vacuum_truncate = true/false.
If that doesn't appeal, I am open to other ideas how to make this
consistent, but I think it should in some way be made consistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Infinity vs Error for division by zero

2019-03-01 Thread Andres Freund
On 2019-03-01 11:04:04 -0700, David G. Johnston wrote:
> Changing the behavior is not going to happen for any existing data types.

For the overflow case that really sucks, because we're leaving a very
significant amount of performance on the table because we recheck for
overflow in every op. The actual float operation is basically free, but
the overflow check and the calling convention is not. JIT can get of the
latter, but not the former. Which is why we spend like 30% in one of the
TPCH queries doing overflow checks...

I still kinda wonder whether we can make trapping operations work, but
it's not trivial.



Re: Infinity vs Error for division by zero

2019-03-01 Thread Chapman Flack
On 3/1/19 1:04 PM, David G. Johnston wrote:

> 1/0 is an illegal operation.  We could return NaN for it but the choice of
> throwing an error is just as correct.  Returning infinity is strictly
> incorrect.

That differs from my understanding of how the operations are specified
in IEEE 754 (as summarized in, e.g., [1]).

Andrew posted the relevant part of the SQL spec that requires the
operation to raise 22012.

That's a requirement specific to SQL (which is, of course, what matters
here.)

But if someone wanted to write a user-defined division function or
operator that would return Inf for (anything > 0) / 0 and for
(anything < 0) / -0, and -Inf for (anything < 0) / 0 and for
(anything > 0) / -0, and NaN for (either zero) / (either zero), I think
that function or operator would be fully in keeping with IEEE 754.

-Chap


[1] https://steve.hollasch.net/cgindex/coding/ieeefloat.html#operations



Re: Log a sample of transactions

2019-03-01 Thread Adrien NAYRAT

Hello,

On 2/15/19 3:24 PM, Adrien NAYRAT wrote:

On 2/14/19 9:14 PM, Andres Freund wrote:

I wonder if this doesn't need a warning, explaining that using this when
there are large transactions could lead to slowdowns.


Yes, I will add some wording


Warning added.


It seems pretty weird to have this in postgres.c, given you declared it
in xact.h?


Yes, I have to revise my C. I will move it to 
src/backend/access/transam/xact.c


Fixed


Why are both of these checked? ISTM once xact_is_sampled is set, we
ought not to respect a different value of log_xact_sample_rate for the
rest of that transaction.


I added theses checks to allow to disable logging during a sampled 
transaction, per suggestion of Masahiko Sawada:
https://www.postgresql.org/message-id/CAD21AoD4t%2BhTV6XfK5Yz%3DEocB8oMyJSYFfjAryCDYtqfib2GrA%40mail.gmail.com 


I added a comment to explain why transaction logging is rechecked.



As far as I can tell xact_is_sampled is not properly reset in case of
errors?





I am not sure if I should disable logging in case of errors. Actually we 
have:


postgres=# set log_transaction_sample_rate to 1;
SET
postgres=# set client_min_messages to 'LOG';
LOG:  duration: 0.392 ms  statement: set client_min_messages to 'LOG';
SET
postgres=# begin;
LOG:  duration: 0.345 ms  statement: begin;
BEGIN
postgres=# select 1;
LOG:  duration: 0.479 ms  statement: select 1;
 ?column?
--
1
(1 row)

postgres=# select * from missingtable;
ERROR:  relation "missingtable" does not exist
LINE 1: select * from missingtable;
  ^
postgres=# select 1;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block

postgres=# rollback;
LOG:  duration: 11390.295 ms  statement: rollback;
ROLLBACK

If I reset xact_is_sampled (after the first error inside 
AbortTransaction if I am right), "rollback" statement will not be 
logged. I wonder if this "statement" should be logged?


If the answer is no, I will reset xact_is_sampled in AbortTransaction.



Patch attached with fix mentioned above, but without xact_is_sampled reset.

Cheers,


diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6d42b7afe7..dbf5b81243 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5819,6 +5819,32 @@ local0.*/var/log/postgresql

   
 
+ 
+  log_transaction_sample_rate (real)
+  
+   log_transaction_sample_rate configuration parameter
+  
+  
+   
+
+ Set the fraction of transactions whose statements are logged. It applies
+ to each new transaction regardless of their duration. The default is
+ 0, meaning do not log statements from this transaction.
+ Setting this to 1 logs all statements for all transactions.
+ Logging can be disabled inside a transaction by setting this to 0.
+ log_transaction_sample_rate is helpful to track a
+ sample of transaction.
+
+   
+
+ This option can add overhead when transactions are large (many short queries).
+ Or, when the workload is high and log_transaction_sample_rate
+ is closed to 1.
+
+   
+   
+ 
+
  
 
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..2a301b88ea 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -244,6 +244,9 @@ static char *prepareGID;
  */
 static bool forceSyncCommit = false;
 
+/* Flag for logging statements in a transaction. */
+boolxact_is_sampled = false;
+
 /*
  * Private context for transaction-abort work --- we reserve space for this
  * at startup to ensure that AbortTransaction and AbortSubTransaction can work
@@ -1822,6 +1825,9 @@ StartTransaction(void)
 	s->state = TRANS_START;
 	s->transactionId = InvalidTransactionId;	/* until assigned */
 
+	/* Determine if statements are logged in this transaction */
+	xact_is_sampled = random() <= log_xact_sample_rate * MAX_RANDOM_VALUE;
+
 	/*
 	 * initialize current transaction state fields
 	 *
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..76f7f8a912 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2203,6 +2203,8 @@ check_log_statement(List *stmt_list)
  * check_log_duration
  *		Determine whether current command's duration should be logged.
  *		If log_statement_sample_rate < 1.0, log only a sample.
+ *		We also check if this statement in this transaction must be logged
+ *		(regardless of its duration).
  *
  * Returns:
  *		0 if no logging is needed
@@ -2218,7 +2220,8 @@ check_log_statement(List *stmt_list)
 int
 check_log_duration(char *msec_str, bool was_logged)
 {
-	if (log_duration || log_min_duration_statement >= 0)
+	if (log_duration || log_min_duration_statement >= 0 ||
+		(xact_is_sampled && log_xact_sample_rate > 0))
 	{
 		long		secs;
 		int			usecs;
@@ -2252,11 +2255,17 @@ check_l

Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 12:43 PM Andres Freund  wrote:
> Obviously it's nicer looking this way, but OTOH, that means we have to
> send more data over the queue, because we can't easily combine the
> request + "owner". I don't have too strong feelings about it though.

Yeah, I would lean toward combining those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Infinity vs Error for division by zero

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 12:46:55 -0500, Matt Pulver wrote:
> PostgreSQL FLOAT appears to support +/-Infinity and NaN per the IEEE 754
> standard, with expressions such as CAST('NaN' AS FLOAT) and CAST('Infinity'
> AS FLOAT) and even supports ordering columns of floats that contain NaN.
> 
> However the query "SELECT 1.0/0.0;" produces an exception:
> 
> ERROR:  division by zero
> 
> 
> Question: If Infinity and NaN are supported, then why throw an exception
> here, instead of returning Infinity? Is it purely for historical reasons,
> or if it could all be done again, would an exception still be preferred?
> 
> For purely integer arithmetic, I can see how an exception would make sense.
> However for FLOAT, I would expect/prefer Infinity to be returned.

It'd be good for performance reasons to not have to check for that, and
for under/overflow. But the historical behaviour has quite some weight,
and there's some language in the standard that can legitimate be
interpreted that both conditions need to be signalled, if I recall
correctly.

Greetings,

Andres Freund



Re: Infinity vs Error for division by zero

2019-03-01 Thread David G. Johnston
On Friday, March 1, 2019, Matt Pulver  wrote:

> However the query "SELECT 1.0/0.0;" produces an exception:
>
> ERROR:  division by zero
>
>
> Question: If Infinity and NaN are supported, then why throw an exception
> here, instead of returning Infinity? Is it purely for historical reasons,
> or if it could all be done again, would an exception still be preferred?
>
> For purely integer arithmetic, I can see how an exception would make
> sense. However for FLOAT, I would expect/prefer Infinity to be returned.
>

1/0 is an illegal operation.  We could return NaN for it but the choice of
throwing an error is just as correct.  Returning infinity is strictly
incorrect.

Changing the behavior is not going to happen for any existing data types.

David J.


Re: Infinity vs Error for division by zero

2019-03-01 Thread Andrew Gierth
> "Matt" == Matt Pulver  writes:

 Matt> ERROR:  division by zero

 Matt> Question: If Infinity and NaN are supported, then why throw an
 Matt> exception here, instead of returning Infinity?

Spec says so:

  4) The dyadic arithmetic operators , ,
 , and  (+, -, *, and /, respectively) specify
 addition, subtraction, multiplication, and division, respectively.
 If the value of a divisor is zero, then an exception condition is
 raised: data exception -- division by zero.

-- 
Andrew (irc:RhodiumToad)



Infinity vs Error for division by zero

2019-03-01 Thread Matt Pulver
Hello,

PostgreSQL FLOAT appears to support +/-Infinity and NaN per the IEEE 754
standard, with expressions such as CAST('NaN' AS FLOAT) and CAST('Infinity'
AS FLOAT) and even supports ordering columns of floats that contain NaN.

However the query "SELECT 1.0/0.0;" produces an exception:

ERROR:  division by zero


Question: If Infinity and NaN are supported, then why throw an exception
here, instead of returning Infinity? Is it purely for historical reasons,
or if it could all be done again, would an exception still be preferred?

For purely integer arithmetic, I can see how an exception would make sense.
However for FLOAT, I would expect/prefer Infinity to be returned.

Best regards,
Matt


Re: Minimal logical decoding on standbys

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 13:33:23 +0530, tushar wrote:
> While testing  this feature  found that - if lots of insert happened on the
> master cluster then pg_recvlogical is not showing the DATA information  on
> logical replication slot which created on SLAVE.
> 
> Please refer this scenario -
> 
> 1)
> Create a Master cluster with wal_level=logcal and create logical replication
> slot -
>  SELECT * FROM pg_create_logical_replication_slot('master_slot',
> 'test_decoding');
> 
> 2)
> Create a Standby  cluster using pg_basebackup ( ./pg_basebackup -D slave/ -v
> -R)  and create logical replication slot -
> SELECT * FROM pg_create_logical_replication_slot('standby_slot',
> 'test_decoding');

So, if I understand correctly you do *not* have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?

Thanks,

Andres



Re: Refactoring the checkpointer's fsync request queue

2019-03-01 Thread Andres Freund
Hi,

On 2019-03-01 23:17:27 +1300, Thomas Munro wrote:
> @@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
>  * the REDO pointer.  Note that smgr must not do anything that'd have 
> to
>  * be undone if we decide no checkpoint is needed.
>  */
> -   smgrpreckpt();
> +   PreCheckpoint();
> 
> I would call this and the "post" variant something like
> SyncPreCheckpoint().  Otherwise it's too general sounding and not
> clear which module it's in.

Definitely.

> +typedef enum syncrequesttype
> +{
> +   SYNC_REQUEST,
> +   FORGET_REQUEST,
> +   FORGET_HIERARCHY_REQUEST,
> +   UNLINK_REQUEST
> +} syncrequesttype;
> 
> These names are too generic for the global C namespace; how about
> SYNC_REQ_CANCEL or similar?
> 
> +typedef enum syncrequestowner
> +{
> +   SYNC_MD = 0 /* md smgr */
> +} syncrequestowner;
> 
> I have a feeling that Andres wanted to see a single enum combining
> both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
> SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
> have it.

Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.

FWIW, I don't like the name owner here. Class? Method?

Greetings,

Andres Freund



Re: readdir is incorrectly implemented at Windows

2019-03-01 Thread Andrew Dunstan


On 2/25/19 10:38 AM, Konstantin Knizhnik wrote:
> Hi hackers,
>
> Small issue with readir implementation for Windows.
> Right now it returns ENOENT in case of any error returned by
> FindFirstFile.
> So all places in Postgres where opendir/listdir are used will assume
> that directory is empty and
> do nothing without reporting any error.
> It is not so good if directory is actually not empty but there are not
> enough permissions for accessing the directory and FindFirstFile
> returns ERROR_ACCESS_DENIED:
>
> struct dirent *
> readdir(DIR *d)
> {
>     WIN32_FIND_DATA fd;
>
>     if (d->handle == INVALID_HANDLE_VALUE)
>     {
>         d->handle = FindFirstFile(d->dirname, &fd);
>         if (d->handle == INVALID_HANDLE_VALUE)
>         {
>             errno = ENOENT;
>             return NULL;
>         }
>     }
>
>
> Attached please find small patch fixing the problem.
>

Diagnosis seems correct. I wonder if this is responsible for some odd
things we've seen over time on Windows.

This reads a bit oddly:


     {
-            errno = ENOENT;
+            if (GetLastError() == ERROR_FILE_NOT_FOUND)
+            {
+                /* No more files, force errno=0 (unlike mingw) */
+                errno = 0;
+                return NULL;
+            }
+            _dosmaperr(GetLastError());
         return NULL;
     }


Why not something like:


if (GetLastError() == ERROR_FILE_NOT_FOUND)
    errno = 0;
else
    _dosmaperr(GetLastError());
return NULL;


cheers


andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: SQL/JSON: JSON_TABLE

2019-03-01 Thread Nikita Glukhov

On 01.03.2019 19:17, Robert Haas wrote:


On Thu, Feb 28, 2019 at 8:19 PM Nikita Glukhov  wrote:

Attached 34th version of the patches.

Kinda strange version numbering -- the last post on this thread is v21.


For simplicity of dependence tracking, version numbering of JSON_TABLE patches
matches the version numbering of the patches on which it  depends -- jsonpath
and SQL/JSON.  The corresponding jsonpath patch has version v34 now.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: NOT IN subquery optimization

2019-03-01 Thread Tom Lane
Andres Freund  writes:
> On March 1, 2019 4:53:03 AM PST, David Rowley  
> wrote:
>> On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
>>> 1. The patch would give wrong results when the inner side is empty.
>>> 2. Because of the new added predicate 'OR (var is NULL)', we cannot
>>> use hash join or merge join to do the ANTI JOIN.

> I've not checked, but could we please make sure these cases are covered
> in the regression tests today with a single liner?

I'm not sure if the second one is actually a semantics bug or just a
misoptimization?  But yeah, +1 for putting in some simple tests for
corner cases right now.  Anyone want to propose a specific patch?

regards, tom lane



Re: pg_partition_tree crashes for a non-defined relation

2019-03-01 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Feb 28, 2019 at 11:50:16PM -0500, Tom Lane wrote:
>> But, having said that, we've learned that it's generally bad for
>> catalog-query functions to fail outright just because they're pointed
>> at the wrong kind of catalog object.  So I think that what we want here
>> is for pg_partition_tree to return NULL or an empty set or some such
>> for a plain table, while its output for a childless partitioned table
>> should be visibly different from that.  I'm not wedded to details
>> beyond that idea.

> Yep, that's the intention since cc53123.  I won't come back to return
> an ERROR in any case.  Here is what the patch gives for childless
> partitions FWIW:
> =# CREATE TABLE ptif_test (a int, b int) PARTITION BY range (a);
> CREATE TABLE
> =# SELECT * FROM pg_partition_tree('ptif_test');
>relid   | parentrelid | isleaf | level
> ---+-++---
>  ptif_test | null| f  | 0
> (1 row)

Right, while you'd get zero rows out for a non-partitioned table.
WFM.

regards, tom lane



Re: Prevent extension creation in temporary schemas

2019-03-01 Thread Tom Lane
Michael Paquier  writes:
> On Thu, Feb 28, 2019 at 10:52:52PM -0500, Tom Lane wrote:
>> If you're suggesting that we disable that security restriction
>> during extension creation, I really can't see how that'd be a
>> good thing ...

> No, I don't mean that.  I was just wondering if someone can set
> search_path within the SQL script which includes the extension
> contents to bypass the restriction and the error.  They can always
> prefix such objects with pg_temp anyway if need be...

You'd have to look in namespace.c to be sure, but I *think* that
we don't consult the temp schema during function/operator lookup
even if it's explicitly listed in search_path.

It might be possible for an extension script to get around this with
code like, say,

CREATE TRIGGER ... EXECUTE PROCEDURE @extschema@.myfunc();

although you'd have to give up relocatability of the extension
to use @extschema@.  (Maybe it was a bad idea to not provide
that symbol in relocatable extensions?  A usage like this doesn't
prevent the extension from being relocated later.)

regards, tom lane



Re: NOT IN subquery optimization

2019-03-01 Thread Andres Freund
Hi,

On March 1, 2019 4:53:03 AM PST, David Rowley  
wrote:
>On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
>> I have reviewed your patch. Good job except two issues I can find:
>>
>> 1. The patch would give wrong results when the inner side is empty.
>In this
>> case, the whole data from outer side should be in the outputs. But
>with the
>> patch, we will lose the NULLs from outer side.
>>
>> 2. Because of the new added predicate 'OR (var is NULL)', we cannot
>use hash
>> join or merge join to do the ANTI JOIN.  Nested loop becomes the only
>choice,
>> which is low-efficency.
>
>Yeah. Both of these seem pretty fundamental, so setting the patch to
>waiting on author.

I've not checked, but could we please make sure these cases are covered in the 
regression tests today with a single liner? Seems people had to rediscover them 
a number of times now, and unless this thread results in an integrated feature 
soonish, it seems likely other people will again.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: FETCH FIRST clause PERCENT option

2019-03-01 Thread Tomas Vondra


On 3/1/19 2:31 AM, Kyotaro HORIGUCHI wrote:
> Hello.
> 
> At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra 
>  wrote in 
> 
>>> One biggest issue seems to be we don't know the total number of
> 
> # One *of* the biggest *issues*?
> 
>>> outer tuples before actually reading a null tuple. I doubt of
>>> general shortcut for that. It also seems preventing limit node
>>> from just using materialized outer.
>>>
>>
>> Sure, if you actually want all tuples, you'll have to execute the outer
>> plan till completion. But that's not what I'm talking about - what if we
>> only ever need to read one row from the limit?
> 
> We have no choice than once reading all tuples just to find we
> are to return just one row, since estimator is not guaranteed to
> be exact as required for this purpose.
> 

I don't follow - I'm not suggesting to base this on row estimates at
all. I'm saying that we can read in rows, and decide how many rows we
are expected to produce.

For example, with 1% sample, we'll produce about 1 row for each 100 rows
read from the outer plan. So we read the first row, and compute

  rows_to_produce = ceil(0.01 * rows_read)

and every time this increases, we emit another row from a tuplestore.


>> To give you a (admittedly, somewhat contrived and artificial example):
>>
>> SELECT * FROM t1 WHERE id IN (
>>   SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
>> );
>>
>> Maybe this example is bogus and/or does not really matter in practice. I
>> don't know, but I've been unable to convince myself that's the case.
> 
> I see such kind of idiom common. Even in the quite simple example
> above, *we* cannot tell how many tuples the inner should return
> unless we actually fetch all tuples in t2. This is the same
> problem with count(*).
> 
> The query is equivalent to the folloing one.
> 
>  SELECT * FROM t1 WHERE id IN (
>SELECT id FROM t2 ORDER BY x
>  FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
>  );
> 
> This scans t2 twice, but this patch does only one full scan
> moving another partial scan to tuplestore. We would win if the
> outer is complex enough.
> 

But what if all t1 rows match the very first row in the subselect? In
that case we don't need to read all rows from the subplan - we only need
to read the first chunk (until we emit the first row).

> Anyway, even excluding the count(*) issue, it seems that we are
> not alone in that point. (That is, at least Oracle shows
> different E-Rows and A-Rows for PERCENT).
> 

I'm not sure hat E-Rows and A-Rows are? I suppose that's estimated and
actual - but as I explained above, that's not what I propose the
incremental approach to use.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: SQL/JSON: JSON_TABLE

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 8:19 PM Nikita Glukhov  wrote:
> Attached 34th version of the patches.

Kinda strange version numbering -- the last post on this thread is v21.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Andy Fan
Thank you Michael!

 What can I do if I'm sure I will not use the CTAS creation ?   Take a look
at the "heap_create_with_catalog" function, it check it and raise error.
 Even I  change it to  "check it && if_not_existing",  raise error, it is
still be problematic since we may some other session create between the
check and the real creation end.

Looks we need some locking there, but since PG is processes model,  I even
don't know how to sync some code among processes in PG (any hint on this
will be pretty good as well).

On Fri, Mar 1, 2019 at 8:35 PM Michael Paquier  wrote:

> On Fri, Mar 01, 2019 at 07:17:04PM +0800, Andy Fan wrote:
> > for a createStmt,  it will call transformCreateStmt,  and then
> > heap_create_with_catalog.
> > but looks it just check the if_not_exists in transformCreateStmt.
> >
> > is it designed as this on purpose or is it a bug?
>
> That's a bug.  Andreas Karlsson and I have been discussing it a couple
> of days ago actually:
> https://www.postgresql.org/message-id/20190215081451.gd2...@paquier.xyz
>
> Fixing this is not as straight-forward as it seems, as it requires
> shuffling a bit the code related to a CTAS creation so as all code
> paths check at the same time for an existing relation.  Based on my
> first impressions, I got the feeling that it would be rather invasive
> and not worth a back-patch.
> --
> Michael
>


Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Antonin Houska
Etsuro Fujita  wrote:

> (2019/03/01 20:00), Antonin Houska wrote:
> > Etsuro Fujita  wrote:

> > I used gdb to help me understand, however the condition
> >
> > if (fpextra&&  !IS_UPPER_REL(foreignrel))
> >
> > never evaluated to true with the query above.
> 
> Sorry, my explanation was not enough again, but I showed that query ("SELECT
> a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;") to explain why
> the following code bit is needed:
> 
> +   /*
> +* If this includes an UPPERREL_ORDERED step, the given target, which
> +* would be the final target to be applied to the resulting path,
> might
> +* have different expressions from the underlying relation's reltarget
> +* (see make_sort_input_target()); adjust tlist eval costs.
> +*/
> +   if (fpextra&&  fpextra->target != foreignrel->reltarget)
> +   {
> +   QualCostoldcost = foreignrel->reltarget->cost;
> +   QualCostnewcost = fpextra->target->cost;
> +
> +   startup_cost += newcost.startup - oldcost.startup;
> +   total_cost += newcost.startup - oldcost.startup;
> +   total_cost += (newcost.per_tuple - oldcost.per_tuple) * rows;
> +   }

Maybe I undestand now. Do the expressions (newcost.* - oldcost.*) reflect the
fact that, for the query

SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;

the UPPERREL_ORDERED stage only needs to evaluate the random() function
because (a + b) was already evaluated during the UPPERREL_GROUP_AGG stage?

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: Question about commit 11cf92f6e2e13c0a6e3f98be3e629e6bd90b74d5

2019-03-01 Thread Robert Haas
On Fri, Mar 1, 2019 at 5:47 AM Etsuro Fujita
 wrote:
> Robert, I CCed you because you are the author of that commit.  Before
> that commit ("Rewrite the code that applies scan/join targets to
> paths."), apply_scanjoin_target_to_paths() had a boolean parameter named
> modify_in_place, and used apply_projection_to_path(), not
> create_projection_path(), to adjust scan/join paths when modify_in_place
> was true, which allowed us to save cycles at plan creation time by
> avoiding creating projection paths, which I think would be a good thing,
> but that commit removed that.  Why?

One of the goals of the commit was to properly account for the cost of
computing the target list.  Before parallel query and partition-wise
join, it didn't really matter what the cost of computing the target
list was, because every path was going to have to do the same work, so
it was just a constant factor getting added to every path.  However,
parallel query and partition-wise join mean that some paths can
compute the final target list more cheaply than others, and that turns
out to be important for things like PostGIS.  One of the complaints
that provoked that change was that PostGIS was picking non-parallel
plans even when a parallel plan was substantially superior, because it
wasn't accounting for the fact that in the parallel plan, the cost of
computing the target-list could be shared across all the workers
rather than paid entirely by the leader.

In order to accomplish this goal of properly accounting for the cost
of computing the target list, we need to create a new path, not just
jam the target list into an already-costed path.  Note that we did
some performance optimization around the same time to minimize the
performance hit here (see d7c19e62a8e0a634eb6b29f8fd944e57081f,
and I think there may have been something else as well although I
can't find it right now).

> The real reason for this question is: I noticed that projection paths
> put on foreign paths will make it hard for FDWs to detect whether there
> is an already-well-enough-sorted remote path in the pathlist for the
> final scan/join relation as an input relation to GetForeignUpperPaths()
> for the UPPERREL_ORDERED step (the IsA(path, ForeignPath) test would not
> work well enough to detect remote paths!), so I'm wondering whether we
> could revive that parameter like the attached, to avoid the overhead at
> plan creation time and to make the FDW work easy.  Maybe I'm missing
> something, though.

I think this would be a bad idea, for the reasons explained above.  I
also think that it's probably the wrong direction on principle.  I
think the way we account for target lists is still pretty crude and
needs to be thought of as more of a real planning step and less as
something that we can just ignore when it's inconvenient for some
reason.  I think the FDW just needs to look through the projection
path and see what's underneath, but also take the projection path's
target list into account when decide whether more can be pushed down.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: propagating replica identity to partitions

2019-03-01 Thread Robert Haas
On Thu, Feb 28, 2019 at 6:13 PM Alvaro Herrera  wrote:
> Tablespaces already behave a little bit especially in another sense:
> it doesn't recurse to indexes.  I think it's not possible to implement a
> three-way flag, where you tell it to move only the table, or to recurse
> to child tables but leave local indexes alone, or just to move
> everything.  If we don't move the child indexes, are we really recursing
> in that sense?

I don't quite follow this.  If you want to change the default for new
partitions, you would use ONLY, which updates the value for the parent
(which is cheap) but doesn't touch the children.  If you want to move
it all, you would omit ONLY.  I'm not sureI quite understand the third
case - what do you mean by moving the child tables but leaving the
local indexes alone?

> I think changing the behavior of tablespaces is likely to cause pain for
> people with existing scripts that expect the command not to recurse.  We
> would have to add a switch of some kind to provide the old behavior in
> order not to break those, and that doesn't seem so good either.
>
> I agree we definitely want to treat a partitioned table as similarly as
> possible to a non-partitioned table, but in the particular case of
> tablespace it seems to me better not to mess with the behavior.

You may be right, but I'm not 100% convinced.  I don't understand why
changing the behavior for tablespaces would be a grant shocker and
changing the behavior for OWNER TO would be a big nothing.  If you
mess up the first one, you'll realize it's running for too long and go
"oh, oops" and fix it next time.  If you mess up the second one,
you'll have a security hole, be exploited by hackers, suffer civil and
criminal investigations due to a massive data security breach, and end
your life in penury and in prison, friendless and alone.  Or maybe
not, but the point is that BOTH of these things have an established
behavior such that changing it may surprise some people, and actually
I would argue that the surprise for the TABLESPACE will tend to be
more obvious and less likely to have subtle, unintended consequences.

I hate to open a can of worms here, but there's also the relationship
between OWNER TO and GRANT/REVOKE; that might also need a little
thought.

> CLUSTER: I'm not sure about this one.  For legacy inheritance there's
> no principled way to figure out the right index to recurse to.  For
> partitioning that seems a very simple problem, but I do wonder if
> there's any use case at all for ALTER TABLE CLUSTER there.  My
> inclination would be to reject ALTER TABLE CLUSTER on a partitioned
> table altogether, if it isn't already.

Hmm, I disagree.  I think this should work and set the value for the
whole hierarchy.  That seems well-defined and useful -- why is it any
less useful than for a standalone table?

> OWNER: let's just change this one to recurse always.  I think we have a
> consensus about this one.

We really don't have many votes either way, AFAICS.  I'm not direly
opposed, but I'd like to see a more vigorous attempt to make them ALL
recurse rather than changing one per release for the next 8 years.

> TRIGGER: I think it would be good to recurse, based on the trigger name.
> I'm not sure what to do about legacy inheritance; the trigger isn't
> likely to be there.  An option is to silently ignore each inheritance
> child (not partition) if the trigger isn't present on it.

Hmm, I think this one should maybe recurse to triggers that cascaded
downwards, which would of its nature apply to partitioning but not to
inheritance.

> We all seem to agree that REPLICA IDENTITY should recurse.

My feeling on this is the same as for OWNER.

> Maybe ADD GENERATED AS IDENTITY / DROP IDENTITY should recurse; not
> really sure about this one.  Peter?

I don't know enough about this to have an opinion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: partitioned tables referenced by FKs

2019-03-01 Thread Jesper Pedersen

Hi Alvaro,

On 2/28/19 1:28 PM, Alvaro Herrera wrote:

Rebased to current master.  I'll reply later to handle the other issues
you point out.



Applying with hunks.

With 0003 using

export CFLAGS="-DCOPY_PARSE_PLAN_TREES -O0 -fno-omit-frame-pointer" && 
CC="ccache gcc" ./configure --prefix /usr/local/packages/postgresql-12.x 
--enable-dtrace --with-openssl --with-gssapi --with-libxml --with-llvm 
--enable-debug --enable-depend --enable-tap-tests --enable-cassert


I'm getting a failure in the pg_upgrade test:

 --
+-- Name: pk5 pk5_pkey; Type: CONSTRAINT; Schema: regress_fk; Owner: 
jpedersen

+--
+
+ALTER TABLE ONLY regress_fk.pk5
+ADD CONSTRAINT pk5_pkey PRIMARY KEY (a);
+
+
+--

when running check-world.

Best regards,
 Jesper



Re: pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

2019-03-01 Thread Tom Lane
Suraj Kharage  writes:
> The Commit 5955d934194c3888f30318209ade71b53d29777f has changed the logic
> to avoid dumping creation and comment commands for the public schema.

Yup.

> As reported by Prabhat, if we try to restore the custom/tar dump taken from
> v10 and earlier versions, we get the reported error for public schema.

Yes.  We're not intending to do anything about that.  The previous scheme
also caused pointless errors in some situations, so this isn't really a
regression.  The area is messy enough already that trying to avoid errors
even with old (wrong) archives would almost certainly cause more problems
than it solves.  In particular, it's *not* easy to fix things in a way
that works conveniently for both superuser and non-superuser restores.
See the mail thread referenced by 5955d9341.

(Note that it's only been very recently that anyone had any expectation
that pg_dump scripts could be restored with zero errors in all cases;
the usual advice was just to ignore noncritical errors.  I'm not that
excited about it if the old advice is still needed when dealing with old
archives.)

regards, tom lane



Re: PostgreSQL vs SQL/XML Standards

2019-03-01 Thread Ramanarayana
Hi,
Yes it is working fine with \a option in psql.

Cheers
Ram 4.0


Re: PostgreSQL vs SQL/XML Standards

2019-03-01 Thread Chapman Flack
On 03/01/19 07:15, Ramanarayana wrote:
> Hi,
> I have tested bug fixes provided by all the patches. They are working
> great. I found one minor issue
> 
>  select * from xmltable('*' PASSING 'pre arg?>&deeppost' COLUMNS x XML PATH '/');
> 
> The above query returns the xml. But there is an extra plus symbol at the
> end
> 
> pre&deeppost+

Hi,

Are you sure that isn't the + added by psql when displaying a value
that contains a newline?

What happens if you repeat the query but after the psql command

  \a

to leave the output unaligned?

Thanks!
-Chap



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Laurenz Albe
Magnus Hagander wrote:
> Maybe have the first note say "This method is deprecated bceause it has 
> serious
> risks (see bellow)" and then list the actual risks at the end? 

Good idea.  That may attract the attention of the dogs among the readers.

Yours,
Laurenz Albe




Re: pgsql: Build src/port files as a library with -fPIC, and use that in li

2019-03-01 Thread Christoph Berg
Re: Tom Lane 2019-01-31 <12792.1548965...@sss.pgh.pa.us>
> initdb hasn't depended on those libpq exports since 8.2, and it was
> a bug that it did so even then.

9.2 psql (and createuser, ...) is also broken:

/usr/lib/postgresql/9.2/bin/psql: symbol lookup error: 
/usr/lib/postgresql/9.2/bin/psql: undefined symbol: pqsignal

Could we at least put a compat symbol back?

Christoph



Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread Magnus Hagander
On Fri, Mar 1, 2019 at 2:07 PM David Steele  wrote:

> On 3/1/19 1:13 PM, Peter Eisentraut wrote:
> > Please follow the 1-space indentation in the documentation files.
>
> Whoops.  Will fix.
>
> > I think the style of mentioning all the problems in a note before the
> > actual description is a bit backwards.
>
> In the case of an important note like this I think it should be right at
> the top where people will see it.  Not everyone reads to the end
>

Maybe have the first note say "This method is deprecated bceause it has
serious risks (see bellow)" and then list the actual risks at the end?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Add exclusive backup deprecation notes to documentation

2019-03-01 Thread David Steele

On 3/1/19 1:13 PM, Peter Eisentraut wrote:

Please follow the 1-space indentation in the documentation files.


Whoops.  Will fix.


I think the style of mentioning all the problems in a note before the
actual description is a bit backwards.


In the case of an important note like this I think it should be right at 
the top where people will see it.  Not everyone reads to the end.


--
-David
da...@pgmasters.net



Re: speeding up planning with partitions

2019-03-01 Thread Amit Langote
On 2019/03/01 22:01, Amit Langote wrote:
> Of course, I had to make sure that query_planner (which is not
> in the charge of adding source inheritance child objects) can notice that.

Oops, I meant to write "query_planner (which *is* in the charge of adding
source inheritance child objects)..."

Thanks,
Amit




Re: speeding up planning with partitions

2019-03-01 Thread Amit Langote
Imai-san,

Thanks for testing and sorry it took me a while to reply.

On 2019/02/25 15:24, Imai, Yoshikazu wrote:
> [update_pt_with_joining_another_pt.sql]
> update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b;
> 
> [pgbench]
> pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 postgres
> 
> [results]
> (part_num_rt, part_num_jrt)  master  patched(0001)
> ---  --  -
>   (3, 1024)8.06   5.89
>   (3, 2048)1.52   0.87
>   (6, 1024)4.11   1.77
> 
> 
> 
> With HEAD, we add target inheritance and source inheritance to parse->rtable 
> in inheritance_planner and copy and adjust it for child tables at beginning 
> of each planning of child tables.
> 
> With the 0001 patch, we add target inheritance to parse->rtable in 
> inheritance_planner and add source inheritance to parse->rtable in 
> make_one_rel(under grouping_planner()) during each planning of child tables.
> Adding source inheritance to parse->rtable may be the same process between 
> each planning of child tables and it might be useless. OTOH, from the POV of 
> making inheritance-expansion-at-bottom better, expanding source inheritance 
> in make_one_rel seems correct design to me.
> 
> How should we do that...?

To solve this problem, I ended up teaching inheritance_planner to reuse
the objects for *source* inheritance child relations (RTEs,
AppendRelInfos, and PlanRowMarks) created during the planning of the 1st
child query for the planning of subsequent child queries.  Set of source
child relations don't change between different planning runs, so it's okay
to do so.  Of course, I had to make sure that query_planner (which is not
in the charge of adding source inheritance child objects) can notice that.

Please find attached updated patches.  Will update source code comments,
commit message and perform other fine-tuning over the weekend if possible.

Thanks,
Amit
From 2f4c085145c3991c8d7160eab0c90d5d3f9d713e Mon Sep 17 00:00:00 2001
From: amit 
Date: Fri, 26 Oct 2018 16:45:59 +0900
Subject: [PATCH v25 1/2] Lazy creation of RTEs for inheritance children

Rewrite inherit.c so that source inheritance could be expanded
during query_planner().  That's better because for partitioned
tables, we can add only the partitions left after performing
partition pruning.

Since it's not known what RowMark identities are necessary for
individual inheritance children until they're added to the query,
preprocess_targetlist must delay adding junk columns if there are
any parent PlanRowMarks.

For partitioning, although we don't create a RangeTblEntry and
RelOptInfo for pruned partitions at make_one_rel time, partitionwise
join code relies on the fact that even though partitions may have
been pruned, they'd still own a RelOptInfo to handle the outer join
case where the pruned partition appears on the nullable side of join.
Partitionwise join code deals with that by allocating dummy
RelOptInfos for pruned partitions that are based mostly on their
parent's properties.

There are a few regression test diffs:

 1. Caused by the fact that we no longer allocate a duplicate RT
entry for a partitioned table in its role as child, as seen in
the partition_aggregate.out test output.

 2. Those in postgres_fdw.out are caused by the fact that junk columns
required for row marking are added to reltarget->exprs later than
user columns, because the row marking junk columns aren't added
until the inheritance is expanded which as of this commit is
later than it used to be as noted above.
---
 contrib/postgres_fdw/expected/postgres_fdw.out|  24 +-
 src/backend/optimizer/path/allpaths.c | 202 +--
 src/backend/optimizer/path/joinrels.c |  92 +++-
 src/backend/optimizer/plan/initsplan.c|  53 +-
 src/backend/optimizer/plan/planmain.c |  15 +-
 src/backend/optimizer/plan/planner.c  | 271 +++---
 src/backend/optimizer/plan/setrefs.c  |   6 +
 src/backend/optimizer/prep/preptlist.c| 130 +++--
 src/backend/optimizer/util/inherit.c  | 620 +-
 src/backend/optimizer/util/plancat.c  |  18 +-
 src/backend/optimizer/util/relnode.c  | 359 +++--
 src/backend/partitioning/partprune.c  |  44 +-
 src/include/nodes/pathnodes.h |   7 +
 src/include/optimizer/inherit.h   |   5 +-
 src/include/optimizer/pathnode.h  |   3 +
 src/include/optimizer/plancat.h   |   4 +-
 src/include/optimizer/planmain.h  |   1 +
 src/include/optimizer/prep.h  |   2 +
 src/test/regress/expected/partition_aggregate.out |   4 +-
 19 files changed, 1185 insertions(+), 675 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out 
b/contrib/postgres_fdw/expected/postgres_fdw.out
index

Re: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Michael Meskes
Hi Matsumura-san,

> I must use a midrule action like the following that works as
> expected.
> I wonder how to write the replacement to ecpg.addons.
> I think it's impossible, right? Please give me some advice.

You are right, for this change you have to replace the whole rule. This
cannot be done with an addon. Please see the attached for a way to do
this, untested though.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
diff --git a/src/interfaces/ecpg/preproc/ecpg.trailer b/src/interfaces/ecpg/preproc/ecpg.trailer
index 1d58c778d9..be6c9d0ae9 100644
--- a/src/interfaces/ecpg/preproc/ecpg.trailer
+++ b/src/interfaces/ecpg/preproc/ecpg.trailer
@@ -1,5 +1,26 @@
 /* src/interfaces/ecpg/preproc/ecpg.trailer */
 
+PrepareStmt: PREPARE prepared_name prep_type_clause AS
+{
+prepared_name = $2;
+prepare_type_clause = $3;
+is_in_preparable_stmt = true;
+}
+PreparableStmt
+{
+$$.name = prepared_name;
+$$.type = prepare_type_clause;
+$$.stmt = $6;
+is_in_preparable_stmt = false;
+}
+| PREPARE prepared_name FROM execstring
+	{
+$$.name = $2;
+$$.type = NULL;
+$$.stmt = $4;
+}
+	;
+
 statements: /*EMPTY*/
 | statements statement
 		;
diff --git a/src/interfaces/ecpg/preproc/ecpg.type b/src/interfaces/ecpg/preproc/ecpg.type
index 519b737dde..1c68095274 100644
--- a/src/interfaces/ecpg/preproc/ecpg.type
+++ b/src/interfaces/ecpg/preproc/ecpg.type
@@ -145,3 +145,5 @@
 %type var_type
 
 %type   action
+
+%type  PrepareStmt
diff --git a/src/interfaces/ecpg/preproc/parse.pl b/src/interfaces/ecpg/preproc/parse.pl
index 6f67a5e942..cafca22f7a 100644
--- a/src/interfaces/ecpg/preproc/parse.pl
+++ b/src/interfaces/ecpg/preproc/parse.pl
@@ -64,6 +64,7 @@ my %replace_types = (
 	'stmtblock'  => 'ignore',
 	'stmtmulti'  => 'ignore',
 	'CreateAsStmt'   => 'ignore',
+	'PrepareStmt'	 => 'ignore',
 	'DeallocateStmt' => 'ignore',
 	'ColId'  => 'ignore',
 	'type_function_name' => 'ignore',


Re: NOT IN subquery optimization

2019-03-01 Thread David Rowley
On Fri, 1 Mar 2019 at 15:27, Richard Guo  wrote:
> I have reviewed your patch. Good job except two issues I can find:
>
> 1. The patch would give wrong results when the inner side is empty. In this
> case, the whole data from outer side should be in the outputs. But with the
> patch, we will lose the NULLs from outer side.
>
> 2. Because of the new added predicate 'OR (var is NULL)', we cannot use hash
> join or merge join to do the ANTI JOIN.  Nested loop becomes the only choice,
> which is low-efficency.

Yeah. Both of these seem pretty fundamental, so setting the patch to
waiting on author.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: FETCH FIRST clause PERCENT option

2019-03-01 Thread Surafel Temesgen
On Fri, Mar 1, 2019 at 4:33 AM Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello.
>
> At Thu, 28 Feb 2019 21:16:25 +0100, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote in <
> fbd08ad3-5dd8-3169-6cba-38d610d7b...@2ndquadrant.com>
> > > One biggest issue seems to be we don't know the total number of
>
> # One *of* the biggest *issues*?
>
> > > outer tuples before actually reading a null tuple. I doubt of
> > > general shortcut for that. It also seems preventing limit node
> > > from just using materialized outer.
> > >
> >
> > Sure, if you actually want all tuples, you'll have to execute the outer
> > plan till completion. But that's not what I'm talking about - what if we
> > only ever need to read one row from the limit?
>
> We have no choice than once reading all tuples just to find we
> are to return just one row, since estimator is not guaranteed to
> be exact as required for this purpose.
>
> > To give you a (admittedly, somewhat contrived and artificial example):
> >
> > SELECT * FROM t1 WHERE id IN (
> >   SELECT id FROM t2 ORDER BY x FETCH FIRST 10 PERCENT ROWS ONLY
> > );
> >
> > Maybe this example is bogus and/or does not really matter in practice. I
> > don't know, but I've been unable to convince myself that's the case.
>
> I see such kind of idiom common. Even in the quite simple example
> above, *we* cannot tell how many tuples the inner should return
> unless we actually fetch all tuples in t2. This is the same
> problem with count(*).
>
> The query is equivalent to the folloing one.
>
>  SELECT * FROM t1 WHERE id IN (
>SELECT id FROM t2 ORDER BY x
>  FETCH FIRST (SELECT ceil(count(*) * 0.1) FROM t2) ROWS ONLY
>  );
>
> This scans t2 twice, but this patch does only one full scan
> moving another partial scan to tuplestore. We would win if the
> outer is complex enough.
>

Okay here is the previous implementation with uptread review comment
included and it also consider OFFSET clause in percentage calculation

regards
Surafel
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 06d611b64c..e3ce4d7e36 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1430,7 +1430,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] [ PERCENT ] { ROW | ROWS } ONLY
 
 In this syntax, the start
 or count value is required by
@@ -1440,7 +1440,8 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
+with PERCENT count specifies the maximum number of rows to return 
+in percentage.ROW
 and ROWS as well as FIRST
 and NEXT are noise words that don't influence
 the effects of these clauses.
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index baa669abe8..2be6f655fb 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -43,6 +43,7 @@ ExecLimit(PlanState *pstate)
 	LimitState *node = castNode(LimitState, pstate);
 	ScanDirection direction;
 	TupleTableSlot *slot;
+	TupleDesc   tupleDescriptor;
 	PlanState  *outerPlan;
 
 	CHECK_FOR_INTERRUPTS();
@@ -52,6 +53,8 @@ ExecLimit(PlanState *pstate)
 	 */
 	direction = node->ps.state->es_direction;
 	outerPlan = outerPlanState(node);
+	slot = node->subSlot;
+	tupleDescriptor = node->ps.ps_ResultTupleDesc;
 
 	/*
 	 * The main logic is a simple state machine.
@@ -60,6 +63,26 @@ ExecLimit(PlanState *pstate)
 	{
 		case LIMIT_INITIAL:
 
+			if (node->limitOption == PERCENTAGE)
+			{
+
+/*
+ * In PERCENTAGE option the number of tuple to return can't
+ * be determine before receiving the last tuple. so execute
+ * outerPlan until the end and store the result tuple to avoid
+ * executing twice
+ */
+for (;;)
+{
+	slot = ExecProcNode(outerPlan);
+	if (TupIsNull(slot))
+	{
+		break;
+	}
+	tuplestore_puttupleslot(node->tuple_store, slot);
+}
+			}
+
 			/*
 			 * First call for this node, so compute limit/offset. (We can't do
 			 * this any earlier, because parameters from upper nodes will not
@@ -87,24 +110,47 @@ ExecLimit(PlanState *pstate)
 return NULL;
 			}
 
-			/*
-			 * Fetch rows from subplan until we reach position > offset.
-			 */
-			for (;;)
+			if (node->limitOption == PERCENTAGE)
 			{
-slot = ExecProcNode(outerPlan);
-	

Re: Problems with plan estimates in postgres_fdw

2019-03-01 Thread Etsuro Fujita

(2019/03/01 20:00), Antonin Houska wrote:

Etsuro Fujita  wrote:

(2019/02/22 22:54), Antonin Houska wrote:

Etsuro Fujita   wrote:

So, the two changes are handling different
cases, hence both changes would be required.



+   /*
+* If this is an UPPERREL_ORDERED step performed on the final
+* scan/join relation, the costs obtained from the cache wouldn't yet
+* contain the eval costs for the final scan/join target, which would
+* have been updated by apply_scanjoin_target_to_paths(); add the eval
+* costs now.
+*/
+   if (fpextra&&  !IS_UPPER_REL(foreignrel))
+   {
+   /* The costs should have been obtained from the cache. */
+   Assert(fpinfo->rel_startup_cost>= 0&&
+  fpinfo->rel_total_cost>= 0);
+
+   startup_cost += foreignrel->reltarget->cost.startup;
+   run_cost += foreignrel->reltarget->cost.per_tuple * rows;
+   }



Yeah, but I think the code bit cited above is needed.  Let me explain using
yet another example with grouping and ordering:

 SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;

For this query, the sort_input_target would be {a+b}, (to which the
foreignrel->reltarget would have been set when called from
estimate_path_cost_size() called from postgresGetForeignUpperPaths() with the
UPPERREL_ORDERED stage), but the final target would be {a+b, random()}, so the
sort_input_target isn't the same as the final target in that case; the costs
needs to be adjusted so that the costs include the ones of generating the
final target.  That's the reason why I added the code bit you cited.


I used gdb to help me understand, however the condition

if (fpextra&&  !IS_UPPER_REL(foreignrel))

never evaluated to true with the query above.


Sorry, my explanation was not enough again, but I showed that query 
("SELECT a+b, random() FROM foreign_table GROUP BY a+b ORDER BY a+b;") 
to explain why the following code bit is needed:


+   /*
+* If this includes an UPPERREL_ORDERED step, the given target, 
which
+* would be the final target to be applied to the resulting 
path, might
+* have different expressions from the underlying relation's 
reltarget

+* (see make_sort_input_target()); adjust tlist eval costs.
+*/
+   if (fpextra&&  fpextra->target != foreignrel->reltarget)
+   {
+   QualCostoldcost = foreignrel->reltarget->cost;
+   QualCostnewcost = fpextra->target->cost;
+
+   startup_cost += newcost.startup - oldcost.startup;
+   total_cost += newcost.startup - oldcost.startup;
+   total_cost += (newcost.per_tuple - oldcost.per_tuple) * 
rows;

+   }

I think that the condition

if (fpextra && fpextra->target != foreignrel->reltarget)

would be evaluated to true for that query.


It's still unclear to me why add_foreign_ordered_paths() passes the input
relation (input_rel) to estimate_path_cost_size(). If it passed the output rel
(i.e. ordered_rel in this case) like add_foreign_grouping_paths() does, then
foreignrel->reltarget should contain the random() function. (What I see now is
that create_ordered_paths() leaves ordered_rel->reltarget empty, but that's
another problem to be fixed.)

Do you still see a reason to call estimate_path_cost_size() this way?


Yeah, the reason for that is because we can estimate the costs of 
sorting for the final scan/join relation using the existing code as-is 
that estimates the costs of sorting for base or join relations, except 
for tlist eval cost adjustment.  As you mentioned, we could pass 
ordered_rel to estimate_path_cost_size(), but if so, I think we would 
need to get input_rel (ie, the final scan/join relation) from 
ordered_rel within estimate_path_cost_size(), to use that code, which 
would not be great.


Best regards,
Etsuro Fujita




Re: Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Michael Paquier
On Fri, Mar 01, 2019 at 07:17:04PM +0800, Andy Fan wrote:
> for a createStmt,  it will call transformCreateStmt,  and then
> heap_create_with_catalog.
> but looks it just check the if_not_exists in transformCreateStmt.
> 
> is it designed as this on purpose or is it a bug?

That's a bug.  Andreas Karlsson and I have been discussing it a couple
of days ago actually:
https://www.postgresql.org/message-id/20190215081451.gd2...@paquier.xyz

Fixing this is not as straight-forward as it seems, as it requires
shuffling a bit the code related to a CTAS creation so as all code
paths check at the same time for an existing relation.  Based on my
first impressions, I got the feeling that it would be rather invasive
and not worth a back-patch.
--
Michael


signature.asc
Description: PGP signature


Re: using index or check in ALTER TABLE SET NOT NULL

2019-03-01 Thread David Rowley
On Sun, 15 Apr 2018 at 19:09, Sergei Kornilov  wrote:
> Attached updated patch follows recent Reorganize partitioning code commit.

I've made a pass over v10. I think it's in pretty good shape, but I
did end up changing a few small things.

1. Tweaked the documents a bit. I was going to just change "Full table
scan" to "A full table scan", but I ended up rewriting the entire
paragraph.
2. In ATRewriteTables, updated comment to mention "If required" since
the scan may not be required if SET NOT NULLs are already proved okay.
3. Updated another forgotten comment in ATRewriteTables.
4. Removed mention about table's with OIDs in ATExecAddColumn(). This
seems left over from 578b229718.
5. Changed ATExecSetNotNull not to check for matching constraints if
we're already going to make a pass over the table. Setting
verify_new_notnull to true again does not make it any more true. :)
6. Tweaked NotNullImpliedByRelConstraints() a bit. You were calling
lappend on an empty list. make_list1() is fine. Don't need a variable
for that, just pass it to the function.
7. Tightened up the comments in ConstraintImpliedByRelConstraint() to
mention that it only supports immutable quals containing vars with
varno=1. Removed the comment near the bottom that mentioned that in
passing.

The only thing that I'm a bit unsure of is the tests. I've read the
thread and I see the discussion above about it.  I'd personally have
thought INFO was fine since ATTACH PARTITION does that, but I see
objections.  It appears that all the tests just assume that the CHECK
constraint was used to validate the SET NOT NULL. I'm not all that
sure if there's any good reason not to set client_min_messages =
'debug1' just before your test then RESET client_min_messages;
afterwards.  No other tests seem to do it, but my only thoughts on the
reason not to would be that it might fail if someone added another
debug somewhere, but so what? they should update the expected results
too.

Oh, one other thing I don't really like is that
ConstraintImpliedByRelConstraint() adds all of the other column's IS
NOT NULL constraints to the existConstraint. This is always wasted
effort for the SET NOT NULL use case.  I wonder if a new flag should
be added to control this, or even better, separate that part out and
put back the function named PartConstraintImpliedByRelConstraint and
have it do the IS NOT NULL building before calling
ConstraintImpliedByRelConstraint(). No duplicate code that way and you
also don't need to touch partbound.c

Setting this on waiting on author.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


alter_table_set_not_null_by_constraints_only_v10_fixes.patch
Description: Binary data


RE: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Matsumura, Ryo
Hi Meskes-san

I must use a midrule action like the following that works as expected.
I wonder how to write the replacement to ecpg.addons.
I think it's impossible, right? Please give me some advice.

 PrepareStmt:
PREPARE prepared_name prep_type_clause AS
{
prepared_name = $2;
prepare_type_clause = $3;
is_in_preparable_stmt = true;
}
PreparableStmt
{
$$.name = prepared_name;
$$.type = prepare_type_clause;
$$.stmt = $6;
is_in_preparable_stmt = false;
}
| PREPARE prepared_name FROM execstring

Regards
Ryo Matsumura

> -Original Message-
> From: Michael Meskes [mailto:mes...@postgresql.org]
> Sent: Friday, March 1, 2019 8:42 PM
> To: Matsumura, Ryo/松村 量 ; Takahashi,
> Ryohei/高橋 良平 ;
> 'pgsql-hack...@postgresql.org' 
> Subject: Re: SQL statement PREPARE does not work in ECPG
> 
> Hi Matsumura-san,
> 
> > I will read README.parser, ecpg.addons, and *.pl to understand.
> 
> Feel free to ask, when anything comes up.
> 
> Michael
> --
> Michael Meskes
> Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
> Meskes at (Debian|Postgresql) dot Org
> Jabber: michael at xmpp dot meskes dot org
> VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
> 



Re: House style for DocBook documentation?

2019-03-01 Thread Ramanarayana
Hi,
I have tested bug fixes provided by all the patches. They are working
great. I found one minor issue

 select * from xmltable('*' PASSING 'pre&deeppost' COLUMNS x XML PATH '/');

The above query returns the xml. But there is an extra plus symbol at the
end

pre&deeppost+

Regards,
Ram


Re: SQL statement PREPARE does not work in ECPG

2019-03-01 Thread Michael Meskes
Hi Matsumura-san,

> I will read README.parser, ecpg.addons, and *.pl to understand.

Feel free to ask, when anything comes up.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL




pg_background and BGWH_STOPPED

2019-03-01 Thread Švorc Martin
Hello,

We probably identified a bug in the pg_background implementation: 
https://github.com/vibhorkum/pg_background 
It is a race condition when starting the process and BGWH_STOPPED is returned - 
see the pull request for more info: 
https://github.com/RekGRpth/pg_background/pull/1 

I think I have created a fix (in one of the forks of the original repo, 
https://github.com/RekGRpth/pg_background, which already addresses some 
compilation issues), but then again I am not very familiar with PG API and 
would very much appreciate if anyone could review  the bug and approve the  
solution.

Regards

Martin

-Original Message-
From: pgsql-hackers-ow...@postgresql.org 
[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of amul sul
Sent: Thursday, November 24, 2016 4:47 AM
To: PostgreSQL-development
Subject: pg_background contrib module proposal

Hi All,

I would like to take over pg_background patch and repost for discussion and 
review.

Initially Robert Haas has share this for parallelism demonstration[1] and 
abandoned later with summary of open issue[2] with this pg_background patch 
need to be fixed, most of them seems to be addressed in core except handling of 
type exists without binary send/recv functions and documentation.
I have added handling for types that don't have binary send/recv functions in 
the attach patch and will work on documentation at the end.

One concern with this patch is code duplication with exec_simple_query(), we 
could consider Jim Nasby’s patch[3] to overcome this,  but  certainly we will 
end up by complicating
exec_simple_query() to make pg_background happy.

As discussed previously[1] pg_background is a contrib module that lets you 
launch arbitrary command in a background worker.

• VACUUM in background
• Autonomous transaction implementation better than dblink way (i.e.
no separate authentication required).
• Allows to perform task like CREATE INDEX CONCURRENTLY from a procedural 
language.

This module comes with following SQL APIs:

• pg_background_launch : This API takes SQL command, which user wants to 
execute, and size of queue buffer.
  This function returns the process id of background worker.
• pg_background_result   : This API takes the process id as input
parameter and returns the result of command
  executed thought the background worker.
• pg_background_detach : This API takes the process id and detach the 
background process which is waiting for  user to read its results.


Here's an example of running vacuum and then fetching the results.
Notice that the
notices from the original session are propagated to our session; if an error 
had occurred, it would be re-thrown locally when we try to read the results.

postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,5)); INSERT 0 5

postgres=# select pg_background_launch('vacuum verbose foo'); 
pg_background_launch
--
  65427
(1 row)

postgres=# select * from pg_background_result(65427) as (x text);
INFO:  vacuuming "public.foo"
INFO:  "foo": found 0 removable, 5 nonremovable row versions in 1 out of 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
 x

VACUUM
(1 row)


Thanks to Vibhor kumar, Rushabh Lathia and Robert Haas for feedback.

Please let me know your thoughts, and thanks for reading.

[1]. 
https://www.postgresql.org/message-id/CA%2BTgmoam66dTzCP8N2cRcS6S6dBMFX%2BJMba%2BmDf68H%3DKAkNjPQ%40mail.gmail.com
[2]. 
https://www.postgresql.org/message-id/CA%2BTgmobPiT_3Qgjeh3_v%2B8Cq2nMczkPyAYernF_7_W9a-6T1PA%40mail.gmail.com
[3]. https://www.postgresql.org/message-id/54541779.1010906%40BlueTreble.com

Regards,
Amul


Looks heap_create_with_catalog ignored the if_not_exists options

2019-03-01 Thread Andy Fan
for a createStmt,  it will call transformCreateStmt,  and then
heap_create_with_catalog.
but looks it just check the if_not_exists in transformCreateStmt.

so there is a chance that when the transformCreateStmt is called, the table
is not created, but before the heap_create_with_catalog is called,  the
table was created.  if so, the "if not exits" will raise error  "ERROR:
relation "" already exists"

I can reproduce this with gdb,

demo=# create table if not exists 2 (a int);
ERROR:  relation "2" already exists

is it designed as this on purpose or is it a bug?

I am using the lates commit on github now.


Re: pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

2019-03-01 Thread Suraj Kharage
Hi,

The Commit 5955d934194c3888f30318209ade71b53d29777f has changed the logic
to avoid dumping creation and comment commands for the public schema.
>From v11 onwards, we are using the DUMP_COMPONENT_ infrastructure in
selectDumpableNamespace() to skip the public schema creation.

As reported by Prabhat, if we try to restore the custom/tar dump taken from
v10 and earlier versions, we get the reported error for public schema.
The reason for this error is, when we take custom/tar dump from v10 and
earlier version, it has "CREATE SCHEMA public;" statement and v11 failed to
bypass that as per the current logic.

The plain format does not produces the error in this case, because in all
versions, pg_dump in plain format does not generate that "CREATE SCHEMA
public". In v10 and earlier, we filter out that public schema creation in
_printTocEntry() while pg_dump.

In custom/tar format, pg_dump in V10 and earlier versions generate the
schema creation statement for public schema but again while pg_restore in
same or back branches, it get skipped through same _printTocEntry()
function.

I think we can write a logic in -
1) BulidArchiveDependencies() to avoid dumping creation and comment
commands for the public schema since we do not have DUMP_COMPONENT_
infrastructure in all supported back-branches.
or
2) dumpNamespace() to not include public schema creation.

Thoughts?

Regards,
Suraj


  1   2   >