Re: [PATCH] Implement uuid_version()

2019-07-12 Thread Fabien COELHO



Hello Jose,

Got it, and done. Please find attached a v2 patch with the upgrade script 
included.


Patch v2 applies cleanly. Compiles cleanly (once running configure 
--with-uuid=...). Local make check ok. Doc build ok.


There are no tests, I'd suggest to add some under sql & change expected if 
possible which would return all possible values, including with calling 
pg_random_uuid() which should return 4.


Documentation describes uuid_version(), should it not describe 
uuid_version(namespace uuid)?


I'd suggest to add an example.

The extension update script seems ok, but ISTM that "uuid-ossp-1.1.sql" 
should be replaced with an updated "uuid-ossp-1.2.sql".


--
Fabien.




Re: Allow an alias to be attached directly to a JOIN ... USING

2019-07-12 Thread Thomas Munro
On Tue, Jun 18, 2019 at 2:41 AM Peter Eisentraut
 wrote:
> A small new feature in SQL:2016 allows attaching a table alias to a
> JOIN/USING construct:
>
>  ::=
>   USING   
>   [ AS  ]
>
> (The part in brackets is new.)
>
> This seems quite useful, and it seems the code would already support
> this if we allow the grammar to accept this syntax.

Neat.  That's a refreshingly short patch to get a sql_features.txt
line bumped to YES.

> Patch attached.

It does what it says on the tin.

I see that USING is the important thing here; for (a NATURAL JOIN b)
AS ab or (a JOIN b ON ...) AS ab you still need the parentheses or
(respectively) it means something different (alias for B only) or
doesn't parse.  That makes sense.

I noticed that the HINT when you accidentally use a base table name
instead of a table alias is more helpful than the HINT you get when
you use a base table name instead of a join alias.  That seems like a
potential improvement that is independent of this syntax change.

-- 
Thomas Munro
https://enterprisedb.com




Re: [PATCH] Implement uuid_version()

2019-07-12 Thread Fabien COELHO



Hello Peter,


A possible option 3 is to keep the function in pgcrypto but change
its C code to call the core code.


Updated patch with this change included.


Patch applies cleanly, compiles (both pg and pgcrypto). make check (global 
and pgcrypto) ok. Doc generation ok. Minor comments:


About doc: I'd consider "generation" instead of "generating" as a 
secondary index term.



(There is also precedent for redirecting the extension function to the
internal one by changing the SQL-level function definition using CREATE
OR REPLACE FUNCTION ... LANGUAGE INTERNAL.  But that seems more
complicated and would require a new extension version.  It could maybe
be included if the extension version is changed for other reasons.)


What about avoiding a redirection with something like:

Datum (* const pg_random_uuid)(PG_FUNCTION_ARGS) = gen_random_uuid;

--
Fabien.




Re: Why is infinite_recurse test suddenly failing?

2019-07-12 Thread Thomas Munro
On Tue, May 21, 2019 at 9:22 AM Mark Wong  wrote:
> Andrew let me know I need to get on v10.  I've upgraded demoiselle, and
> am trying to work through the rest now...

Here's another crash like that.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38

2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
was terminated by signal 11: Segmentation fault
2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
running: select infinite_recurse();

-- 
Thomas Munro
https://enterprisedb.com




Re: pg_stat_statements vs. SELECT FOR UPDATE

2019-07-12 Thread Thomas Munro
On Thu, Jul 11, 2019 at 9:08 PM Sergei Kornilov  wrote:
> Patch is still applied cleanly on HEAD and passes check-world. I think 
> ignoring FOR UPDATE clause is incorrect behavior.

With my reviewer hat: I agree.

With my CFM hat: It seems like this patch is ready to land so I have
set it to "Ready for Committer".  But don't worry if you were hoping
to review this and missed out, we have two more pg_stat_statements
patches that need your feedback!

https://commitfest.postgresql.org/23/2080/
https://commitfest.postgresql.org/23/1999/

-- 
Thomas Munro
https://enterprisedb.com




Re: Tab completion for CREATE TYPE

2019-07-12 Thread Thomas Munro
On Tue, May 14, 2019 at 11:13 PM Kyotaro HORIGUCHI
 wrote:
> At Tue, 14 May 2019 18:58:14 +1200, Thomas Munro  
> wrote in 
> > On Tue, May 14, 2019 at 6:18 PM Kyotaro HORIGUCHI
> >  wrote:
> > > I played with this a bit and found that "... (attr=[tab]" (no
> > > space between "r" and "=") complets with '='. Isn't it annoying?
> > >
> > > Only "UPDATE hoge SET a=[tab]" behaves the same way among
> > > existing completions.
> >
> > Hmm.  True.  Here's one way to fix that.
>
> Thanks. That's what was in my mind.

I pushed a fix for that separately.  I remembered that we had decided
to use MatchAnyExcept("...") instead of "!...", so I did it that way.

> Some definition item names are induced from some current states
> (e.g. "CREATE TYPE name AS RANGE (" => "SUBTYPE = ") but I think
> it's too much.
>
> COLLATE is not suggested with possible collations but I think
> suggesting it is not so useful.

Yes, there is room to make it smarter.

> PASSEDBYVALUE is suggested with '=', which is different from
> documented syntax but I don't think that's not such a problem for
> those who spell this command out.
>
> # By the way, collatable and preferred are boolean which behaves
> # the same way with passedbyvalue. Is there any intention in the
> # difference in the documentation?

Good question.

> The completion lists contain all possible words correctly (I
> think "analyse" is an implicit synonym.).

I am not a fan of doing anything at all to support alternative
spellings for keywords etc, even though I personally use British
spelling in most contexts outside PostgreSQL source code.  We don't
support MATERIALISED, CATALOGUE, BACKWARDS/FORWARDS (with an S), etc,
so I don't know why we have this one single word ANALYSE from a
different spelling system than the one used by SQL.

> As the result, I find it perfect.

Pushed.  Thanks for the review!

-- 
Thomas Munro
https://enterprisedb.com




Re: Introduce timeout capability for ConditionVariableSleep

2019-07-12 Thread Thomas Munro
On Fri, Jul 12, 2019 at 6:08 PM Shawn Debnath  wrote:
> On Tue, Jul 09, 2019 at 11:03:18PM +1200, Thomas Munro wrote:
> > On Sun, Jul 7, 2019 at 3:09 PM Thomas Munro  wrote:
> > > +   /* Timed out */
> > > +   if (rc == 0)
> > > +   return true;
> >
> > Here's a version without that bit, because I don't think we need it.
>
> This works. Agree that letting it fall through covers the first gap.

Pushed, like that (with the now unused rc variable also removed).
Thanks for the patch!

> > > That still leaves the danger that the CV can be signalled some time
> > > after ConditionVariableTimedSleep() returns.  So now I'm wondering if
> > > ConditionVariableCancelSleep() should signal the CV if it discovers
> > > that this process is not in the proclist, on the basis that that must
> > > indicate that we've been signalled even though we're not interested in
> > > the message anymore, and yet some other process else might be
> > > interested, and that might have been the only signal that is ever
> > > going to be delivered by ConditionVariableSignal().
> >
> > And a separate patch to do that.  Thoughts?
>
> I like it. This covers the gap all the way till cancel is invoked and it
> manipulates the list to remove itself or realizes that it needs to
> forward the signal to some other process.

I pushed this too.  It's a separate commit, because I think there is
at least a theoretical argument that it should be back-patched.  I'm
not going to do that today though, because I doubt anyone is relying
on ConditionVariableSignal() working that reliably yet, and it's
really with timeouts that it becomes a likely problem.

I thought about this edge case because I have long wanted to propose a
pair of functions that provide a simplified payloadless blocking
alternative to NOTIFY, that would allow for just the right number of
waiting sessions to wake up to handle SKIP LOCKED-style job queues.
Otherwise you sometimes get thundering herds of wakeups fighting over
crumbs.  That made me think about the case where a worker session
decides to time out and shut down due to being idle for too long, but
eats a wakeup on its way out.  Another question that comes up in that
use case is whether CV wakeup queues should be LIFO or FIFO.  I think
the answer is LIFO, to support class worker pool designs that
stabilise at the right size using a simple idle timeout rule.  They're
currently FIFO (proclist_pop_head_node() to wake up, but
proclist_push_tail() to sleep).  I understand why Robert didn't care
about that last time I mentioned it: all our uses of CVs today are
"broadcast" wakeups.  But a productised version of the "poke" hack I
showed earlier that supports poking just one waiter would care about
the thing this patch fixed, and also the wakeup queue order.

-- 
Thomas Munro
https://enterprisedb.com




Re: Check-out mutable functions in check constraints

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 07:59:13PM -0400, Tom Lane wrote:

Tomas Vondra  writes:

On Fri, Jul 12, 2019 at 02:00:25PM +0200, Pavel Stehule wrote:

Mutable function are allowed in check constraint expressions but
it is not right. The attached is a proposed fix for it including
regression test.



Yes, and the question is whether this is the right thing to do (I think
it probably is).


I'm pretty sure this change has been proposed before, and rejected before.
Has anybody excavated in the archives for prior discussions?



Yes, I've done some quick searches like "volatile constraint" and so on.
There are a couple of relevant discussions:

2004: 
https://www.postgresql.org/message-id/flat/0C3A1AEC-6BE4-11D8-9224-000A95C88220%40myrealbox.com

2010: 
https://www.postgresql.org/message-id/flat/12849.1277918175%40sss.pgh.pa.us#736c8ef9d7810c0bb85f495490fd40f5

But I don't think the conclusions are particularly clear.

In the first thread you seem to agree with requiring immutable functions
for check constraints (and triggers for one-time checks). The second
thread ended up discussing some new related stuff in SQL standard.

There may be other threads and I just haven't found them, of course.


There are, and always will be, lots of ways to shoot yourself in the foot.
In the case at hand, I fear we might just encourage people to mark
functions as immutable when they really aren't --- which will make their
problems *worse* not better, because now other uses besides check
constraints will also be at risk of misoptimization.


OTOH, even if we prohibit mutable functions in check constraints, people
can still create triggers doing those checks (and shoot themselves in
the foot that way).


There are, and always will be, lots of ways to shoot yourself in the foot.
In the case at hand, I fear we might just encourage people to mark
functions as immutable when they really aren't --- which will make their
problems *worse* not better, because now other uses besides check
constraints will also be at risk of misoptimization.



True.



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: warning to publication created and wal_level is not set to logical

2019-07-12 Thread Thomas Munro
On Sat, Jul 13, 2019 at 3:21 AM Lucas Viecelli  wrote:
>> Agreed, fixed.  Also run through pgindent
>
> Thank you for the adjustments.

> All right, for me. If wal_level can not interfere with the testes result, it 
> seems to a better approach

Pushed.  Thanks for the patch!

-- 
Thomas Munro
https://enterprisedb.com




Some thoughts on heaps and freezing

2019-07-12 Thread Thomas Munro
Hello hackers,

Here's a set of ideas that I think could get rid of wraparound freezes
from the traditional heap, using undo logs technology.  They're
inspired by things that Heikki has said over various threads, adapted
to our proposed undo infrastructure.

1.  Don't freeze committed xids by brute force search.  Keep using the
same tuple header as today, but add a pair of 64 bit FullTrasactionIds
to the page header (low fxid, high fxid) so that xids are not
ambiguous even after they wrap around.  If you ever find yourself
updating high fxid to a value that is too far ahead of low fxid, you
need to do a micro-freeze of the page, but you were already writing to
the page so that's cool.

2.  Get rid of aborted xids eagerly, instead of relying on brute force
scans to move the horizon.  Remove the xid references at rollback time
with the undo machinery we've built for zheap.  While zheap uses undo
records to rollback the effects of a transaction (reversing in-place
updates etc), these would be very simple undo records that simply
remove item pointers relating to aborted transactions, so their xids
vanish from the heap.  Now the horizon for oldest aborted xid that you
can find anywhere in the system is oldest-xid-having-undo, which is
tracked by the undo machinery.  You don't need to keep more clog than
that AFAIK, other than to support the txid_status() function.

3.  Don't freeze multixacts by brute force search.  Instead, invent 64
bit multixacts and track (low fmxid, high fmxid) and do micro-freezing
on the page when the range would be too wide, as we did in point 1 for
xids.

4.  Get rid of multixacts eagerly.  Move the contents of
pg_mutixact/members into undo logs, using the new UNDO_SHARED records
that we invented at PGCon[1] for essentially the same purpose in
zheap.  This is storage that is automatically cleaned up by a "discard
worker" when every member of a set of xids is no longer running (and
it's a bit like the "TED" storage that Heikki talked about a few years
back[2]).  Keep pg_multixact/offsets, but change it to contain undo
record pointers that point to UNDO_SHARED records holding the members.
It is a map of multixact ID -> undo holding the members, and it needs
to exist only to preserve the 32 bit size of multixact IDs; it'd be
nicer to use the undo rec ptr directly, but the goal in this thought
experiment is to make minimal format changes to kill freezing (if you
want more drastic changes, see zheap).  Now you just have to figure
out how to trim pg_multixact/offsets, and I think that could be done
periodically by testing the oldest multixact it holds: has the undo
record it points to been discarded?  If so we can trim this multixact.

Finding room for 4 64 bit values on the page header is of course
tricky and incompatible with pg_upgrade, and hard to support
incrementally.  I also don't know exactly at which point you'd
consider high fxid in visibility computations, considering that in
places where you have a tuple pointer, you can't easily find the high
fxid you need.  One cute but scary idea is that when you're scanning
the heap you'd non-durably clobber xmin and xmax with
FrozenTrasactionId if appropriate.

[1] 
https://www.postgresql.org/message-id/ca+hukgkni7eeu4ft71vzccwpeagb2pqoekofjqjavknd577...@mail.gmail.com
[2] https://www.postgresql.org/message-id/flat/55511D1F.7050902%40iki.fi

-- 
Thomas Munro
https://enterprisedb.com




Re: [sqlsmith] Crash in mcv_get_match_bitmap

2019-07-12 Thread Tomas Vondra

On Thu, Jul 11, 2019 at 05:08:22PM +0200, Tomas Vondra wrote:

On Wed, Jul 10, 2019 at 06:48:16PM -0400, Tom Lane wrote:

Oh ... while we're piling on here, it just sunk into me that
mcv_get_match_bitmap is deciding what the semantics of an operator
are by seeing what it's using for a selectivity estimator.
That is just absolutely, completely wrong.  For starters, it
means that the whole mechanism fails for any operator that wants
to use a specialized estimator --- hardly an unreasonable thing
to do.  For another, it's going to be pretty unreliable for
extensions, because I do not think they're all careful about using
the right estimator --- a lot of 'em probably still haven't adapted
to the introduction of separate <= / >= estimators, for instance.

The right way to determine operator semantics is to look to see
whether they are in a btree opclass.  That's what the rest of the
planner does, and there is no good reason for the mcv code to
do it some other way.



Hmmm, but that will mean we're unable to estimate operators that are not
part of a btree opclass. Which is a bit annoying, because that would also
affect equalities (and thus functional dependencies), in which case the
type may easily have just hash opclass or something.

But maybe I just don't understand how the btree opclass detection works
and it'd be fine for sensibly defined operators. Can you point me to code
doing this elsewhere in the planner?

We may need to do something about code in pg10, because functional
dependencies this too (although just with F_EQSEL).  But maybe we should
leave pg10 alone, we got exactly 0 reports about it until now anyway.



Here are WIP patches addressing two of the issues:

1) determining operator semantics by matching them to btree opclasses

2) deconstructing OpExpr into Var/Const nodes

I'd appreciate some feedback particularly on (1).

For the two other issues mentioned in this thread:

a) I don't think unary-argument OpExpr are an issue, because this is
checked when determining which clauses are compatible (and the code only
allows the case with 2 arguments).

b) Const with constisnull=true - I'm not yet sure what to do about this.
The easiest fix would be to deem those clauses incompatible, but that
seems a bit too harsh. The right thing is probably passing the NULL to
the operator proc (but that means we can't use FunctionCall).

Now, looking at this code, I wonder if using DEFAULT_COLLATION_OID when
calling the operator is the right thing. We're using type->typcollation
when building the stats, so maybe we should do the same thing here.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 
>From 043b7525590278fabacd3c10ef869b1ef2b7358f Mon Sep 17 00:00:00 2001
From: Tomas Vondra 
Date: Fri, 12 Jul 2019 22:26:20 +0200
Subject: [PATCH 1/2] Determine operator semantics using btree opclasses

---
 src/backend/statistics/dependencies.c   | 15 +++---
 src/backend/statistics/extended_stats.c | 69 +++--
 src/backend/statistics/mcv.c| 60 -
 src/include/statistics/statistics.h |  1 +
 4 files changed, 99 insertions(+), 46 deletions(-)

diff --git a/src/backend/statistics/dependencies.c 
b/src/backend/statistics/dependencies.c
index 66c38ce2bc..2723ca33df 100644
--- a/src/backend/statistics/dependencies.c
+++ b/src/backend/statistics/dependencies.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/stratnum.h"
 #include "access/sysattr.h"
 #include "catalog/pg_operator.h"
 #include "catalog/pg_statistic_ext.h"
@@ -788,15 +789,13 @@ dependency_is_compatible_clause(Node *clause, Index 
relid, AttrNumber *attnum)
 * If it's not an "=" operator, just ignore the clause, as it's 
not
 * compatible with functional dependencies.
 *
-* This uses the function for estimating selectivity, not the 
operator
-* directly (a bit awkward, but well ...).
-*
-* XXX this is pretty dubious; probably it'd be better to check 
btree
-* or hash opclass membership, so as not to be fooled by custom
-* selectivity functions, and to be more consistent with 
decisions
-* elsewhere in the planner.
+* To decide this, we need to decide the semantics of the 
operator.
+* We do that by matching it to btree opclasses, and checking 
the
+* strategy numbers. When there's no btree opclass, or when 
there
+* are multiple strategy numbers, we consider the semantics to 
be
+* unclear (i.e. not an equality operator).
 */
-   if (get_oprrest(expr->opno) != F_EQSEL)
+   if (determine_operator_strategy(expr->opno) != 
BTEqualStrategyNumber)
return false;
 
/*

Re: Check-out mutable functions in check constraints

2019-07-12 Thread Tom Lane
Tomas Vondra  writes:
> On Fri, Jul 12, 2019 at 02:00:25PM +0200, Pavel Stehule wrote:
>>> Mutable function are allowed in check constraint expressions but
>>> it is not right. The attached is a proposed fix for it including
>>> regression test.

> Yes, and the question is whether this is the right thing to do (I think
> it probably is).

I'm pretty sure this change has been proposed before, and rejected before.
Has anybody excavated in the archives for prior discussions?

> OTOH, even if we prohibit mutable functions in check constraints, people
> can still create triggers doing those checks (and shoot themselves in
> the foot that way).

There are, and always will be, lots of ways to shoot yourself in the foot.
In the case at hand, I fear we might just encourage people to mark
functions as immutable when they really aren't --- which will make their
problems *worse* not better, because now other uses besides check
constraints will also be at risk of misoptimization.

regards, tom lane




Re: Check-out mutable functions in check constraints

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 02:00:25PM +0200, Pavel Stehule wrote:

pá 12. 7. 2019 v 13:11 odesílatel Tomas Vondra 
napsal:


On Fri, Jul 12, 2019 at 08:55:20AM +0200, Pavel Stehule wrote:
>Hi
>
>pá 12. 7. 2019 v 8:45 odesílatel Kyotaro Horiguchi <
horikyota@gmail.com>
>napsal:
>
>> Hello.
>>
>> As mentioned in the following message:
>>
>>
>>
https://www.postgresql.org/message-id/20190712.150527.145133646.horikyota.ntt%40gmail.com
>>
>> Mutable function are allowed in check constraint expressions but
>> it is not right. The attached is a proposed fix for it including
>> regression test.
>>
>> Other "constraints vs " checks do not seem to be exercised
>> but it would be another issue.
>>
>
>I think so this feature (although is correct) can breaks almost all
>applications - it is 20 year late.
>

I'm not sure it actually breaks such appliations.

Let's assume you have a mutable function (i.e. it may change return value
even with the same parameters) and you use it in a CHECK constraint. Then
I'm pretty sure your application is already broken in various ways and you
just don't know it (sometimes it subtle, sometimes less so).



Years ago SQL functions was used for checks instead triggers - I am not
sure if this pattern was in documentation or not, but surely there was not
any warning against it.

You can see some documents with examples

CREATE OR REPLACE FUNCTION check_func(int)
RETURNS boolean AS $$
SELECT 1 FROM tab WHERE id = $1;
$$ LANGUAGE sql;

CREATE TABLE foo( ... id CHECK(check_func(id)));



Considering this does not work (e.g. because in READ COMMITTED mode you
won't see the effects of uncommitted DELETE), I'd say this is a quite
nice example of the breakage I mentioned before.

You might add locking and make it somewhat safer, but there will still
be plenty of holes (e.g.  because you won't see new but not yet
committed records). But it can cause issues e.g. with pg_dump [1].

So IMHO this is more an argument for adding the proposed check ...





If you have a function that actually is immutable and it's just not marked
accordingly, then that only requires a single DDL to fix that during
upgrade. I don't think that's a massive issue.



These functions are stable, and this patch try to prohibit it.



Yes, and the question is whether this is the right thing to do (I think
it probably is).

OTOH, even if we prohibit mutable functions in check constraints, people
can still create triggers doing those checks (and shoot themselves in
the foot that way).


[1] https://www.postgresql.org/message-id/6753.1452274727%40sss.pgh.pa.us

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Bruce Momjian
On Fri, Jul 12, 2019 at 12:41:19PM -0600, Ryan Lambert wrote:
> >> I vote for AES 256 rather than 128.
> >
> > Why?  This page seems to think 128 is sufficient:
> >
> >         https://crypto.stackexchange.com/questions/20/
> what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
> >
> >         For practical purposes, 128-bit keys are sufficient to ensure
> security.
> >         The larger key sizes exist mostly to satisfy some US military
> >         regulations which call for the existence of several distinct
> "security
> >         levels", regardless of whether breaking the lowest level is already
> far
> >         beyond existing technology.
> 
> After researching AES key sizes a bit more my vote is (surprisingly?) for
> AES-128.  My reasoning is about security, I did not consider performance
> impacts in my decision.

Thank you for this exhaustive research.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Ryan Lambert
>> I vote for AES 256 rather than 128.
>
> Why?  This page seems to think 128 is sufficient:
>
>
https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
>
> For practical purposes, 128-bit keys are sufficient to ensure
security.
> The larger key sizes exist mostly to satisfy some US military
> regulations which call for the existence of several distinct
"security
> levels", regardless of whether breaking the lowest level is
already far
> beyond existing technology.

After researching AES key sizes a bit more my vote is (surprisingly?) for
AES-128.  My reasoning is about security, I did not consider performance
impacts in my decision.

The purpose of longer keys over shorter keys is about ensuring brute-force
attacks are prohibitively expensive.  Finding a flaw in the algorithm is
the goal of cryptanalysis.  Brute force attacks are only advanced by
increasing computing power.

"The security of a symmetric cryptosystem is a function of two things:  the
strength of the algorithm and the length of the key.  The former is more
important... " [1] (pg 151)

"The algorithm must be so secure that there is no better way to break it
than with a brute-force attack." [1] (pg 152)

Finally, a stated recommendation on key size:  "Insist on at least 112-bit
keys." [1] (pg 153)  Schneier also mentions in that section that breaking
an 80-bit key (brute force) is likely not realistic for another 30 years.
ETA: 2045 based on dated published.  Brute forcing a 128 bit key is much
further in the future.

Knowing the algorithm is the important part, onto the strength of the
algorithm.  The abstract from [2] states:

"In the case of AES-128, there is no known attack which is faster than the
2^128 complexity of exhaustive search. However, AES-192 and AES-256 were
recently shown to be breakable by attacks which require 2^176 and 2^119
time, respectively."

This shows that both AES-128 (2^128) and AES-192 (2^176) both provide more
protection in this case than the AES-256 algorithm provides (2^119).  This
may be surprising because all AES encryption does not work the same way,
even though it's "all AES."  Again from [2]:

"The key schedules of AES-128 and AES-192 are slightly different, since
they have to apply more mixing operations to the shorter key in order to
produce the slightly smaller number of subkeys for the various rounds. This
small difference in the key schedules plays a major role in making AES-256
more vulnerable to our attacks, in spite of its longer key and supposedly
higher security."

It appears the required key mixing that occurs with shorter key lengths is
actually a strength of the underlying algorithm, and simplifying the key
mixing is bad.  They go on to restate this in a more succinct and damning
way:  "... it clearly indicates that this part of the design of AES-256 is
seriously flawed."

Schneier also mentions how small changes can have big impacts on the
security: "strong cryptosystems, with a couple of minor changes, can become
weak." [1] (pg 152)


[1] Schneier, B. (2015). Applied Cryptography: Protocols, Algorithms and
Source Code in C (20th Anniversary ed.). John Wiley & Sons.

[2] Biryukov, A., Dunkelman, O., Keller, N., Khovratovich, D., & Shamir, A.
(2009). Key Recovery Attacks of Practical Complexity on AES-256 Variants
with up to 10 Rounds. Retreived from https://eprint.iacr.org/2009/374.pdf


Ryan Lambert
RustProof Labs


Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-07-12 Thread Mike Palmiotto
On Fri, Jul 12, 2019 at 4:25 AM Amit Langote  wrote:
>
> Sorry for jumping into this thread pretty late.  I have read the
> messages on this thread a couple of times and...
>
> On Fri, Jul 12, 2019 at 3:05 PM Kyotaro Horiguchi
>  wrote:
> > I'm on Peter's side. Unlike other similar hooks, this hook is
> > provided just to introduce arbitrary inconsistency in partition
> > pruning.
>
> ...I have questions about the patch as proposed, such as what Peter
> and Horiguchi-san might.

Thank you for taking a look at this. I need a bit of time to research
and affirm my previous assumptions, but will address all of these
points as soon as I can.

-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com




Re: [PATCH] kNN for btree

2019-07-12 Thread Nikita Glukhov

Attached 13th version of the patches.

On 08.07.2019 21:09, Alexander Korotkov wrote:

I have more thoughts about planner/executor infrastructure.  It
appears that supporting "ORDER BY col1, col2 <-> val" is too complex
for initial version of patch.  Handling of "ORDER BY col" and "ORDER
BY col <-> val" cases uses different code paths in optimizer.

So, I'd like to limit first version of this patch to support just most
simple "ORDER BY col <-> val" case.  We would probably be able to
enhance it even in v13 release cycle, but let's start with just simple
case.  I also think we can evade replacing amcanorderbyop flag with
method, but introduce just new boolean flag indicating knn supports
only first column.


Now patches 1-8 implement only "ORDER BY col <-> const" case.
ammatchorderby() is replaced with amorderbyopfirstcol flag.

Patches 9-12 contain ammatchorderby() and other features, so they may
not be reviewed right now.


On 08.07.2019 11:07, Thomas Munro wrote:

make check-world fails for me, and in tmp_install/log/install.log I see:
btree_int2.c:97:9: error: implicit declaration of function 'int2dist'
is invalid in C99 [-Werror,-Wimplicit-function-declaration]
 return int2dist(fcinfo);
^
btree_int2.c:97:9: note: did you mean 'int2_dist'?
btree_int2.c:95:1: note: 'int2_dist' declared here
int2_dist(PG_FUNCTION_ARGS)
^
1 error generated.

Fixed.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-Fix-get_index_column_opclass-v13.patch.gz
Description: application/gzip


0002-Introduce-ammorderbyopfirstcol-v13.patch.gz
Description: application/gzip


0003-Enable-ORDER-BY-operator-scans-on-ordered-indexes-v13.patch.gz
Description: application/gzip


0004-Extract-structure-BTScanState-v13.patch.gz
Description: application/gzip


0005-Add-kNN-support-to-btree-v13.patch.gz
Description: application/gzip


0006-Add-btree-distance-operators-v13.patch.gz
Description: application/gzip


0007-Remove-distance-operators-from-btree_gist-v13.patch.gz
Description: application/gzip


0008-Add-regression-tests-for-kNN-btree-v13.patch.gz
Description: application/gzip


0009-Introduce-ammatchorderby-function-v13.patch.gz
Description: application/gzip


0010-Add-btree-support-of-kNN-on-non-first-column-v13.patch.gz
Description: application/gzip


0011-Allow-ammatchorderby-to-return-pathkey-sublists-v13.patch.gz
Description: application/gzip


0012-Add-support-of-array-ops-to-btree-kNN-v13.patch.gz
Description: application/gzip


Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-12 Thread Pavel Stehule
Hi

pá 12. 7. 2019 v 13:07 odesílatel Adrien Nayrat 
napsal:

> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
>
> Sorry, I forgot to tick "passed" boxes.


Thank you

Pavel


Re: proposal - patch: psql - sort_by_size

2019-07-12 Thread Pavel Stehule
pá 12. 7. 2019 v 15:10 odesílatel Fabien COELHO 
napsal:

>
> Hello Pavel,
>
> > rebased patch attached
>
> I prefer patches with a number rather than a date, if possible. For one
> thing, there may be several updates in one day.
>
> About this version (20180708, probably v3): applies cleanly, compiles,
> make check ok, doc build ok. No tests.
>
> It works for me on a few manual tests against a 11.4 server.
>
> Documentation: if you say "\d*+", then it already applies to \db+ and
> \dP+, so why listing them? Otherwise, state all commands or make it work
> on all commands that have a size?
>

\dT+ show sizes too, and there is a mix of values "1, 4, 8, 12, 24, var". I
don't think so sort by size there has sense


> About the text:
>- remove , before "sorts"
>- ... outputs by decreasing size, when size is displayed.
>- add: When size is not displayed, the output is sorted by names.
>

ok

>
> I still think that the object name should be kept as a secondary sort
> criterion, in case of size equality, so that the output is deterministic.
> Having plenty of objects of the same size out of alphabetical order looks
> very strange.
>
> I still do not like much the boolean approach. I understand that the name
> approach has been rejected, and I can understand why.
>
> I've been thinking about another more generic interface, that I'm putting
> here for discussion, I do not claim that it is a good idea. Probably could
> fall under "over engineering", but it might not be much harder to
> implement, and it solves a few potential problems.
>
> The idea is to add an option to \d commands, such as "\echo -n":
>
>\dt+ [-o 1d,2a] ...
>
> meaning do the \dt+, order by column 1 descending, column 2 ascending.
> With this there would be no need for a special variable nor other
> extensions to specify some ordering, whatever the user wishes.
>
> Maybe it could be "\dt+ [-o '1 DESC, 2 ASC'] ..." so that the string
> is roughly used as an ORDER BY specification by the query, but it would be
> longer to specify.
>

I have two objections - although I think so this functionality can coexists
with functionality implemented by this patch

1. You cannot use column number for sort by size, because this value is
prettified (use pg_size_pretty).

2. Because @1, then there is not simple solution for sort by size

3. This extension should be generic, and then it will be much bigger patch


> It also solves the issue that if someone wants another sorting order we
> would end with competing boolean variables such as SORT_BY_SIZE,
> SORT_BY_TYPE, SORT_BY_SCHEMA, which would be pretty unpractical. The
> boolean approach works for *one* sorting extension and breaks at the next
> extension.
>
> Also, the boolean does not say that it is a descending order. I could be
> interested in looking at the small tables.
>
> Another benefit for me is that I do not like much variables with side
> effects, whereas with an explicit syntax there would be no such thing, the
> user has what was asked for. Ok, psql is full of them, but I cannot say I
> like it for that.
>
> The approach could be extended to specify a limit, eg \dt -l 10 would
> add a LIMIT 10 on the query.
>

It is common problem - when you do some repeated task, then you want to do
quickly.  But sometimes you would to do some specialized task, and then you
should to overwrite default setting easy.

Good system should to support both. But commands that allows
parametrization can be hard for learning, hard for use. There are lot of
users of "vim" or "emacs", but most users prefers "notepad".

All is about searching some compromise.



> Also, the implementation could be high enough so that the description
> handlers would not have to deal with it individually, it could return
> the query which would then be completed with SORT/LIMIT clauses before
> being executed, possibly with a default order if none is specified.
>

I don't think so your proposal is bad, and it is not in conflict with this
patch, but it

a) doesn't solve SORT BY SIZE problem
b) requires modification of parser of any related \command - so it will be
bigger and massive patch.

In this moment I prefer my simple implementation still. My patch is related
just for few describe commands. Your proposal should be really generic
(there is not a reason limit it just for reports with size)

Simple boolean design doesn't block any enhancing of future. The effect of
SORT_BY_SIZE variable can be overwritten by some specialized future option
used inside \command.

Regards

Pavel


>
> --
> Fabien.
>
>
>


Re: warning to publication created and wal_level is not set to logical

2019-07-12 Thread Tom Lane
Thomas Munro  writes:
> On Wed, Jul 10, 2019 at 12:47 PM Tom Lane  wrote:
>> 3. AFAICS, the proposed test case changes will cause the core regression
>> tests to fail if wal_level is not replica.  This is not true today ---
>> they pass regardless of wal_level --- and I object in the strongest terms
>> to making it otherwise.

> ... how about we just suppress WARNINGs for
> CREATE PUBLICATION commands that are expected to succeed?  Like in the
> attached.  This version passes installcheck with any wal_level.

WFM.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Bruce Momjian
On Fri, Jul 12, 2019 at 02:15:02PM +0900, Masahiko Sawada wrote:
> > We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
> > 8k pages will use the LSN as a nonce, which will be encrypted to
> > generate the initialization vector (IV).  We will not encrypt the first
> > 16 bytes of each pages so the LSN can be used in this way.  The WAL will
> > use the WAL file segment number as the nonce and the IV will be created
> > in the same way.
> >
> > wal_log_hints will be enabled automatically in encryption mode, like we
> > do for checksum mode, so we never encrypt different 8k pages with the
> > same IV.
> 
> I guess that different two pages can have the same LSN when a heap
> update modifies both a page for old tuple and another page for new
> tuple.
> 
> heapam.c:3707
> recptr = log_heap_update(relation, buffer,
>  newbuf, &oldtup, heaptup,
>  old_key_tuple,
>  all_visible_cleared,
>  all_visible_cleared_new);
> if (newbuf != buffer)
> {
> PageSetLSN(BufferGetPage(newbuf), recptr);
> }
> PageSetLSN(BufferGetPage(buffer), recptr);
> 
> Wouldn't it a problem?

I had the same question.  If someone does:

UPDATE tab SET col = col + 1

then each row change gets its own LSN.  You are asking if an update that
just expires one row and adds it to a new page gets the same LSN.  I
don't know.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Brazil disables DST - 2019b update

2019-07-12 Thread Tom Lane
Tomas Vondra  writes:
> On Fri, Jul 12, 2019 at 02:52:53PM +1000, Robins Tharakan wrote:
>> An August release does give a little more comfort. (I was expecting that
>> the August
>> date would get pushed out since 11.4 was an emergency release at the end of
>> June).

> I think the plan is still to do the August release.

Yes, the August releases will happen on the usual schedule.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Bruce Momjian
On Fri, Jul 12, 2019 at 07:26:21AM -0400, Sehrope Sarkuni wrote:
> On Thu, Jul 11, 2019 at 9:05 PM Bruce Momjian  wrote:
> >
> > On Thu, Jul 11, 2019 at 08:41:52PM -0400, Joe Conway wrote:
> > > I vote for AES 256 rather than 128.
> >
> > Why?  This page seems to think 128 is sufficient:
> >
> > 
> > https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
> >
> > For practical purposes, 128-bit keys are sufficient to ensure 
> > security.
> > The larger key sizes exist mostly to satisfy some US military
> > regulations which call for the existence of several distinct 
> > "security
> > levels", regardless of whether breaking the lowest level is already 
> > far
> > beyond existing technology.
> >
> > We might need to run some benchmarks to determine the overhead of going
> > to AES256, because I am unclear of the security value.
> 
> If the algorithm and key size is not going to be configurable then
> better to lean toward the larger size, especially given the desire for
> future proofing against standards evolution and potential for the
> encrypted data to be very long lived. NIST recommends AES-128 or
> higher but there are other publications that recommend AES-256 for
> long term usage:
> 
> NIST - 2019 : Recommends AES-128, AES-192, or AES-256.
> https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf
> 
> NSA - 2016 : Recommends AES-256 for future quantum resistance.
> https://apps.nsa.gov/iaarchive/library/ia-guidance/ia-solutions-for-classified/algorithm-guidance/cnsa-suite-and-quantum-computing-faq.cfm
> 
> ECRYPT - 2015 - Recommends AES-256 for future quantum resistance.
> https://www.ecrypt.eu.org/csa/documents/PQC-whitepaper.pdf
> 
> ECRYPT - 2018 - Recommends AES-256 for long term use.
> https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf

Oh, interesting.  Let's see what performance tests with the database
show.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Bruce Momjian
On Fri, Jul 12, 2019 at 03:20:37PM +0900, Masahiko Sawada wrote:
> Thank you for summarizing the discussion, it's really helpful. I'll
> update the wiki page based on the summary.
> 
> On Fri, Jul 12, 2019 at 10:05 AM Bruce Momjian  wrote:
> > > The keys themselves should be in an file which is encrypted by a master
> > > key. Obtaining the master key should be pattern it after the GUC
> > > ssl_passphrase_command.
> 
> +1.
> I will update the patch set based on the decision on this thread.

Thanks.

> > > > We will use CBC AES128 mode for tables/indexes, and CTR AES128 for WAL.
> > > > 8k pages will use the LSN as a nonce, which will be encrypted to
> > > > generate the initialization vector (IV).  We will not encrypt the first
> > > > 16 bytes of each pages so the LSN can be used in this way.  The WAL will
> > > > use the WAL file segment number as the nonce and the IV will be created
> > > > in the same way.
> > >
> > > I vote for AES 256 rather than 128.
> >
> > Why?  This page seems to think 128 is sufficient:
> >
> > 
> > https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
> >
> > For practical purposes, 128-bit keys are sufficient to ensure 
> > security.
> > The larger key sizes exist mostly to satisfy some US military
> > regulations which call for the existence of several distinct 
> > "security
> > levels", regardless of whether breaking the lowest level is already 
> > far
> > beyond existing technology.
> >
> > We might need to run some benchmarks to determine the overhead of going
> > to AES256, because I am unclear of the security value.
> 
> 'openssl speed' will help to see the performance differences easily.
> FWIW I got the following result in my environment (Intel(R) Core(TM)
> i7-3770 CPU @ 3.40GHz).
> 
> $ openssl speed -evp aes-128-cbc
> type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
> aes-128-cbc 642449.60k   656404.63k   700231.23k   706461.71k   706051.44k
> 
> $ openssl speed -evp aes-256-cbc
> type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
> aes-256-cbc 466787.73k   496237.08k   503477.16k   507113.32k   508453.80k

I saw similar numbers on my Intel(R) Xeon(R) CPU E5620  @ 2.40GHz with
AES optimization enabled on the CPUs:

$ grep -i '\' /proc/cpuinfo | wc -l
16

Doing aes-128-cbc for 3s on 8192 size blocks: 254000 aes-128-cbc's in 
3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 182496 aes-256-cbc's in 
3.00s

which shows AES256 as 40% slower than AES128, which matches the 40%
mentioned here:


https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc

The larger key sizes imply some CPU overhead (+20% for a 192-bit key,
+40% for a 256-bit key: internally, the AES is a sequence of "rounds"
and the AES standard says that there shall be 10, 12 or 14 rounds, for a
128-bit, 192-bit or 256-bit key, respectively). So there is some
rational reason not to use a larger than necessary key.

> Regarding the security value, I found an interesting post by Bruce Schneier.
> 
> https://www.schneier.com/blog/archives/2009/07/another_new_aes.html
> 
> "And for new applications I suggest that people don't use AES-256.
> AES-128 provides more than enough security margin for the forseeable
> future. But if you're already using AES-256, there's no reason to
> change."

Yes, that is what I have heard too.  I think the additional number of
people who use encryption because of its lower overhead will greatly
outweigh the benefit of using AES256 vs AES128.

> > Yes, sorry, master key rotation is simple.  It is encryption key
> > rotation that I think needs a tool.
> 
> Agreed.
> 
> To rotate the master key we can have a SQL function or dedicated SQL
> command passing the new master key or the passphrase to postgres.

Well, depending on how we store the encryption key, we will probably
change the master key via a command-line tool like pgchecksums.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: warning to publication created and wal_level is not set to logical

2019-07-12 Thread Lucas Viecelli
> Agreed, fixed.  Also run through pgindent
>

Thank you for the adjustments.


> I agree that it's not really worth having tests for this, and I take
> your point about the dependency on wal_level that we don't currently
> have.  The problem is that the core tests include publications
> already, and it doesn't seem like a great idea to move the whole lot
> to a TAP test.  Creating alternative expected files seems like a bad
> idea too (annoying to maintain, wouldn't compose well with the next
> thing like this).  So... how about we just suppress WARNINGs for
> CREATE PUBLICATION commands that are expected to succeed?  Like in the
> attached.  This version passes installcheck with any wal_level.
>
All right, for me. If wal_level can not interfere with the testes result,
it seems to a better approach

*Lucas Viecelli*





Re: Brazil disables DST - 2019b update

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 02:52:53PM +1000, Robins Tharakan wrote:

On Fri, 12 Jul 2019 at 14:04, Michael Paquier  wrote:


On Fri, Jul 12, 2019 at 01:42:59PM +1000, Robins Tharakan wrote:
So 2019b has been released on the 1st of July.  Usually tzdata updates
happen just before a minor release, so this would get pulled in at the
beginning of August (https://www.postgresql.org/developer/roadmap/).
Tom, I guess that would be again the intention here?
--
Michael



An August release does give a little more comfort. (I was expecting that
the August
date would get pushed out since 11.4 was an emergency release at the end of
June).



I think the plan is still to do the August release. One of the fixes in
the out-of-cycle release actually introduced a new regression, but we've
decided no to do another one exactly because there's a next minor release
scheduled in ~three weeks anyway.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Brazil disables DST - 2019b update

2019-07-12 Thread Tom Lane
Magnus Hagander  writes:
> On Fri, Jul 12, 2019 at 4:14 PM Tom Lane  wrote:
>> The usual recommendation on well-maintained production systems is to
>> configure PG with --with-system-tzdata, then rely on your platform
>> vendor for timely updates of that data.

> It should be noted that this is not true on Windows -- on Windows we cannot
> use the system timezone functionality, and rely entirely on the files we
> ship as part of our release.

IMO this is one of many reasons why Windows isn't a great choice of
platform for production use of Postgres ;-).

I hear that Microsoft is going to start embedding some flavor of
Linux in Windows, which presumably would extend to having a copy of
/usr/share/zoneinfo somewhere.  It'll be interesting to see how that
works and whether they'll maintain it well enough that it'd be a
plausible tzdata reference.

regards, tom lane




Re: Brazil disables DST - 2019b update

2019-07-12 Thread Magnus Hagander
On Fri, Jul 12, 2019 at 4:14 PM Tom Lane  wrote:

> Robins Tharakan  writes:
> > The 2019b DST update [1] disables DST for Brazil. This would take effect
> > starting November 2019. The last DST update in Postgres was 2019a in
> v11.3
> > (since this update came in after the recent-most Postgres release).
>
> Yeah.  I intend to install 2019b (or later?) before our next minor
> releases.
>
> > Since a ~3 month release cycle may be too close for some users, are there
> > any plans for an early 11.5 (or are such occurrences not a candidate for
> an
> > early release)?
>
> We do not consider tzdb updates to be a release-forcing issue.
> The fact that we ship tzdb at all is just a courtesy to PG users who
> are on platforms that lack a more direct way to get tzdb updates.
> The usual recommendation on well-maintained production systems is to
> configure PG with --with-system-tzdata, then rely on your platform
> vendor for timely updates of that data.
>

It should be noted that this is not true on Windows -- on Windows we cannot
use the system timezone functionality, and rely entirely on the files we
ship as part of our release.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: [PATCH] get rid of StdRdOptions, use individual binary reloptions representation for each relation kind instead

2019-07-12 Thread Nikolay Shaplov
В письме от четверг, 4 июля 2019 г. 19:44:42 MSK пользователь Dent John 
написал:
> Hi Nikolay,
> 
> I have had a crack at re-basing the current patch against 12b2, but I didn’t
> trivially succeed.
> 
> It’s probably more my bodged fixing of the rejections than a fundamental
> problem. But I now get compile fails in — and seems like only — vacuum.c.

I am really sorry for such big delay. Two new relation options were added, it 
needed careful checking while importing them back to the patch.
I did not find proper timeslot for doing this quick enough.
Hope I am not terribly late.
Here goes an updated version.



diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 5773021..5483378 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -22,7 +22,7 @@
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "access/reloptions.h"
-#include "access/spgist.h"
+#include "access/spgist_private.h"
 #include "access/tuptoaster.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
@@ -46,9 +46,8 @@
  * upper and lower bounds (if applicable); for strings, consider a validation
  * routine.
  * (ii) add a record below (or use add__reloption).
- * (iii) add it to the appropriate options struct (perhaps StdRdOptions)
- * (iv) add it to the appropriate handling routine (perhaps
- * default_reloptions)
+ * (iii) add it to the appropriate options struct
+ * (iv) add it to the appropriate handling routine
  * (v) make sure the lock level is set correctly for that operation
  * (vi) don't forget to document the option
  *
@@ -1030,7 +1029,7 @@ extractRelOptions(HeapTuple tuple, TupleDesc tupdesc,
 		case RELKIND_TOASTVALUE:
 		case RELKIND_MATVIEW:
 		case RELKIND_PARTITIONED_TABLE:
-			options = heap_reloptions(classForm->relkind, datum, false);
+			options = relation_reloptions(classForm->relkind, datum, false);
 			break;
 		case RELKIND_VIEW:
 			options = view_reloptions(datum, false);
@@ -1363,67 +1362,73 @@ fillRelOptions(void *rdopts, Size basesize,
 
 
 /*
- * Option parser for anything that uses StdRdOptions.
+ * Option parsing definition for autovacuum. Used in toast and heap options.
+ */
+
+#define AUTOVACUUM_RELOPTIONS(OFFSET)\
+		{"autovacuum_enabled", RELOPT_TYPE_BOOL, \
+		OFFSET + offsetof(AutoVacOpts, enabled)},\
+		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, vacuum_threshold)},   \
+		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,\
+		OFFSET + offsetof(AutoVacOpts, analyze_threshold)},  \
+		{"autovacuum_vacuum_cost_delay", RELOPT_TYPE_REAL,   \
+		OFFSET + offsetof(AutoVacOpts, vacuum_cost_delay)},  \
+		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,\
+		OFFSET + offsetof(AutoVacOpts, vacuum_cost_limit)},  \
+		{"autovacuum_freeze_min_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, freeze_min_age)}, \
+		{"autovacuum_freeze_max_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, freeze_max_age)}, \
+		{"autovacuum_freeze_table_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, freeze_table_age)},   \
+		{"autovacuum_multixact_freeze_min_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_min_age)},   \
+		{"autovacuum_multixact_freeze_max_age", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_max_age)},   \
+		{"autovacuum_multixact_freeze_table_age", RELOPT_TYPE_INT,   \
+		OFFSET + offsetof(AutoVacOpts, multixact_freeze_table_age)}, \
+		{"log_autovacuum_min_duration", RELOPT_TYPE_INT, \
+		OFFSET + offsetof(AutoVacOpts, log_min_duration)},   \
+		{"autovacuum_vacuum_scale_factor", RELOPT_TYPE_REAL, \
+		OFFSET + offsetof(AutoVacOpts, vacuum_scale_factor)},\
+		{"autovacuum_analyze_scale_factor", RELOPT_TYPE_REAL,\
+		OFFSET + offsetof(AutoVacOpts, analyze_scale_factor)}
+
+/*
+ * Option parser for heap
  */
 bytea *
-default_reloptions(Datum reloptions, bool validate, relopt_kind kind)
+heap_reloptions(Datum reloptions, bool validate)
 {
 	relopt_value *options;
-	StdRdOptions *rdopts;
+	HeapRelOptions *rdopts;
 	int			numoptions;
 	static const relopt_parse_elt tab[] = {
-		{"fillfactor", RELOPT_TYPE_INT, offsetof(StdRdOptions, fillfactor)},
-		{"autovacuum_enabled", RELOPT_TYPE_BOOL,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, enabled)},
-		{"autovacuum_vacuum_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, vacuum_threshold)},
-		{"autovacuum_analyze_threshold", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovacuum) + offsetof(AutoVacOpts, analyze_threshold)},
-		{"autovacuum_vacuum_cost_limit", RELOPT_TYPE_INT,
-		offsetof(StdRdOptions, autovac

Re: Brazil disables DST - 2019b update

2019-07-12 Thread Tom Lane
Robins Tharakan  writes:
> The 2019b DST update [1] disables DST for Brazil. This would take effect
> starting November 2019. The last DST update in Postgres was 2019a in v11.3
> (since this update came in after the recent-most Postgres release).

Yeah.  I intend to install 2019b (or later?) before our next minor
releases.

> Since a ~3 month release cycle may be too close for some users, are there
> any plans for an early 11.5 (or are such occurrences not a candidate for an
> early release)?

We do not consider tzdb updates to be a release-forcing issue.
The fact that we ship tzdb at all is just a courtesy to PG users who
are on platforms that lack a more direct way to get tzdb updates.
The usual recommendation on well-maintained production systems is to
configure PG with --with-system-tzdata, then rely on your platform
vendor for timely updates of that data.

regards, tom lane




Re: Option to dump foreign data in pg_dump

2019-07-12 Thread Luis Carril

> > On 28 Jun 2019, at 19:55, Luis Carril  wrote:
> > What about providing a list of FDW servers instead of an all or nothing 
> > option? In that way the user really has to do a conscious decision to dump 
> > the content of the foreign tables for > > a specific server, this would 
> > allow distinction if multiple FDW are being used in the same DB.

> I think this is a good option, the normal exclusion rules can then still apply
> in case not everything from a specific server is of interest.

Hi, here  is a new patch to dump the data of foreign tables using pg_dump.
This time the user specifies for which foreign servers the data will be dumped, 
which helps in case of having a mix of writeable and non-writeable fdw in the 
database.
It would be nice to emit an error if the fdw is read-only, but that information 
is not available in the catalog.

Cheers
Luis M Carril
From d24cbf4ad0852b079d0e16103486873ab6bb8b69 Mon Sep 17 00:00:00 2001
From: Luis Carril 
Date: Fri, 28 Jun 2019 16:05:43 +0200
Subject: [PATCH] Support foreign data in pg_dump

---
 src/bin/pg_dump/pg_dump.c | 107 +++---
 src/bin/pg_dump/pg_dump.h |   1 +
 2 files changed, 102 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 806fc78f04..ceff6a1744 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -123,6 +123,8 @@ static SimpleStringList table_exclude_patterns = {NULL, NULL};
 static SimpleOidList table_exclude_oids = {NULL, NULL};
 static SimpleStringList tabledata_exclude_patterns = {NULL, NULL};
 static SimpleOidList tabledata_exclude_oids = {NULL, NULL};
+static SimpleStringList foreign_servers_include_patterns = {NULL, NULL};
+static SimpleOidList foreign_servers_include_oids = {NULL, NULL};
 
 
 char		g_opaque_type[10];	/* name for the opaque type */
@@ -159,6 +161,10 @@ static void expand_schema_name_patterns(Archive *fout,
 		SimpleStringList *patterns,
 		SimpleOidList *oids,
 		bool strict_names);
+static void expand_foreign_server_name_patterns(Archive *fout,
+		SimpleStringList *patterns,
+		SimpleOidList *oids,
+		bool strict_names);
 static void expand_table_name_patterns(Archive *fout,
 	   SimpleStringList *patterns,
 	   SimpleOidList *oids,
@@ -391,6 +397,7 @@ main(int argc, char **argv)
 		{"no-sync", no_argument, NULL, 7},
 		{"on-conflict-do-nothing", no_argument, &dopt.do_nothing, 1},
 		{"rows-per-insert", required_argument, NULL, 10},
+		{"include-foreign-data", required_argument, NULL, 11},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -607,6 +614,10 @@ main(int argc, char **argv)
 dopt.dump_inserts = (int) rowsPerInsert;
 break;
 
+			case 11:/* include foreign data */
+simple_string_list_append(&foreign_servers_include_patterns, optarg);
+break;
+
 			default:
 fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
 exit_nicely(1);
@@ -815,6 +826,15 @@ main(int argc, char **argv)
 			   &tabledata_exclude_oids,
 			   false);
 
+	if (foreign_servers_include_patterns.head != NULL)
+	{
+		expand_foreign_server_name_patterns(fout, &foreign_servers_include_patterns,
+			&foreign_servers_include_oids,
+			strict_names);
+		if (foreign_servers_include_oids.head == NULL)
+			fatal("no matching foreign servers were found");
+	}
+
 	/* non-matching exclusion patterns aren't an error */
 
 	/*
@@ -1038,6 +1058,9 @@ help(const char *progname)
 	printf(_("  --use-set-session-authorization\n"
 			 "   use SET SESSION AUTHORIZATION commands instead of\n"
 			 "   ALTER OWNER commands to set ownership\n"));
+	printf(_("  --include-foreign-data=SERVER\n"
+			 "   include data of foreign tables with the named\n"
+			 "   foreign servers in dump\n"));
 
 	printf(_("\nConnection options:\n"));
 	printf(_("  -d, --dbname=DBNAME  database to dump\n"));
@@ -1336,6 +1359,54 @@ expand_schema_name_patterns(Archive *fout,
 	destroyPQExpBuffer(query);
 }
 
+/*
+ * Find the OIDs of all foreign servers matching the given list of patterns,
+ * and append them to the given OID list.
+ */
+static void
+expand_foreign_server_name_patterns(Archive *fout,
+			SimpleStringList *patterns,
+			SimpleOidList *oids,
+			bool strict_names)
+{
+	PQExpBuffer query;
+	PGresult   *res;
+	SimpleStringListCell *cell;
+	int			i;
+
+	if (patterns->head == NULL)
+		return;	/* nothing to do */
+
+	query = createPQExpBuffer();
+
+	/*
+	 * The loop below runs multiple SELECTs might sometimes result in
+	 * duplicate entries in the OID list, but we don't care.
+	 */
+
+	for (cell = patterns->head; cell; cell = cell->next)
+	{
+		appendPQExpBuffer(query,
+		  "SELECT oid FROM pg_catalog.pg_foreign_server s\n");
+		processSQLNamePattern(GetConnection(fout), query, cell->val, false,
+			  false, NULL, "s.

Re: Patch to document base64 encoding

2019-07-12 Thread Fabien COELHO



Hello Karl,


doc_base64_part1_v9.patch
 Only moves, eliminates duplicates, and adjusts indentation.



doc_base64_part2_v9.patch
 Cleanup wording after moving functions between sections.

doc_base64_part3_v9.patch
 Documents base64, hex, and escape encode() and decode()
 formats.



I suggested "Function <>decode ...", which is the kind of thing we
do in academic writing to improve precision, because I thought it
could be better:-)


"Function <>decode ..." just does not work in English.


It really works in research papers: "Theorem X can be proven by applying 
Proposition Y. See Figure 2 for details. Algorithm Z describes whatever,

which is listed in Table W..."


I also alphabetized by format name.


Good:-)


I hope that 3 patches will make review easier.


Not really. I'm reviewing the 3 patches put together rather than each one 
individually, which would require more work.


Patch applies cleanly. Doc build ok.

I looked at the html output, and it seems ok, including navigating to 
conversions or formats explanations.


This documentation patch is an overall improvement and clarifies things, 
including some error conditions.


convert: I'd merge the 2 first sentences to state that if convert from X 
to Y. The doc does not say explicitely what happens if a character cannot 
be converted. After testing, an error is raised. The example comment could 
add ", if possible".


to_hex: add "." at the end of the sentence?

Other descriptions seem ok.

Minor comment: you usually put two spaces between a "." and the first 
world of then next sentence, but not always.


--
Fabien.




Re: POC: Cleaning up orphaned files using undo logs

2019-07-12 Thread Robert Haas
On Fri, Jul 12, 2019 at 5:40 AM Amit Kapila  wrote:
> I am not sure but don't we need to retain the color of z as well?

I believe that would be very wrong. If you recolor an internal node,
you'll break the constant-black-height invariant.

> Apart from this, the duplicate key (ex. for size queues the size of
> two requests can be same) handling might need some work.  Basically,
> either special combine function needs to be written (not sure yet what
> we should do there) or we always need to ensure that the key is unique
> like (size + start_urec_ptr).  If the size is the same, then we can
> decide based on start_urec_ptr.

I think that this problem is somewhat independent of whether we use an
rbtree or a binaryheap or some other data structure.  I would be
inclined to use XID as a tiebreak for the size queue, so that it's
more likely to match the ordering of the XID queue, but if that's
inconvenient, then some other arbitrary value like start_urec_ptr
should be fine.

> I think we can go by changing the implementation to rbtree by doing
> some enhancements instead of the binary heap or alternatively, we can
> use one of the two ideas suggested by you in the email above [1] to
> simplify the code and keep using the binary heap for now.  Especially,
> I like the below one.
> "2. However, I don't think we should have a separate request object
> for each queue anyway. We should insert pointers to the same objects
> in all the relevant queue (either size + XID, or else error). So
> instead of having three sets of objects, one for each queue, we'd just
> have one set of objects and point to them with as many as two
> pointers.
> We'd therefore need LESS memory than we're using today, because we
> wouldn't have separate arrays for XID, size, and error queue
> elements."
>
> I think even if we currently go with a binary heap, it will be
> possible to change it to rbtree later, but I am fine either way.

Well, I don't see much point in revising all of this logic twice. We
should pick the way we want it to work and make it work that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: proposal - patch: psql - sort_by_size

2019-07-12 Thread Fabien COELHO



Hello Pavel,


rebased patch attached


I prefer patches with a number rather than a date, if possible. For one 
thing, there may be several updates in one day.


About this version (20180708, probably v3): applies cleanly, compiles, 
make check ok, doc build ok. No tests.


It works for me on a few manual tests against a 11.4 server.

Documentation: if you say "\d*+", then it already applies to \db+ and 
\dP+, so why listing them? Otherwise, state all commands or make it work 
on all commands that have a size?


About the text:
  - remove , before "sorts"
  - ... outputs by decreasing size, when size is displayed.
  - add: When size is not displayed, the output is sorted by names.

I still think that the object name should be kept as a secondary sort 
criterion, in case of size equality, so that the output is deterministic. 
Having plenty of objects of the same size out of alphabetical order looks 
very strange.


I still do not like much the boolean approach. I understand that the name 
approach has been rejected, and I can understand why.


I've been thinking about another more generic interface, that I'm putting 
here for discussion, I do not claim that it is a good idea. Probably could 
fall under "over engineering", but it might not be much harder to 
implement, and it solves a few potential problems.


The idea is to add an option to \d commands, such as "\echo -n":

  \dt+ [-o 1d,2a] ...

meaning do the \dt+, order by column 1 descending, column 2 ascending.
With this there would be no need for a special variable nor other 
extensions to specify some ordering, whatever the user wishes.


Maybe it could be "\dt+ [-o '1 DESC, 2 ASC'] ..." so that the string
is roughly used as an ORDER BY specification by the query, but it would be 
longer to specify.


It also solves the issue that if someone wants another sorting order we 
would end with competing boolean variables such as SORT_BY_SIZE, 
SORT_BY_TYPE, SORT_BY_SCHEMA, which would be pretty unpractical. The 
boolean approach works for *one* sorting extension and breaks at the next 
extension.


Also, the boolean does not say that it is a descending order. I could be 
interested in looking at the small tables.


Another benefit for me is that I do not like much variables with side 
effects, whereas with an explicit syntax there would be no such thing, the 
user has what was asked for. Ok, psql is full of them, but I cannot say I 
like it for that.


The approach could be extended to specify a limit, eg \dt -l 10 would
add a LIMIT 10 on the query.

Also, the implementation could be high enough so that the description 
handlers would not have to deal with it individually, it could return
the query which would then be completed with SORT/LIMIT clauses before 
being executed, possibly with a default order if none is specified.


--
Fabien.




Re: Allow table AM's to cache stuff in relcache

2019-07-12 Thread Julien Rouhaud
On Fri, Jun 14, 2019 at 5:40 PM Tom Lane  wrote:
>
> Heikki Linnakangas  writes:
> > Index AM's can cache stuff in RelationData->rd_amcache. In the zedstore
> > table AM we've been hacking on, I'd like to also use rd_amcache to cache
> > some information, but that's not currently possible, because rd_amcache
> > can only be used by index AMs, not table AMs.
> > Attached patch allows rd_amcache to also be used by table AMs.
>
> Seems reasonable.

+1.

> > In the patch, I documented that rd_amcache must be allocated in
> > CacheMemoryContext, or in rd_indexcxt if it's an index. It works, but
> > it's a bit weird.
>
> Given the way the patch is implemented, it doesn't really matter which
> context it's in, does it?  The retail pfree is inessential but also
> harmless, if rd_amcache is in rd_indexcxt.  So we could take out the
> "must".  I think it's slightly preferable to use rd_indexcxt if available,
> to reduce the amount of loose junk in CacheMemoryContext.

I agree that for indexes the context used won't make much difference.
But IMHO avoiding some bloat in CacheMemoryContext is a good enough
reason to document using rd_indexcxt when available.

> > It would nice to have one memory context in every
> > relcache entry, to hold all the stuff related to it, including
> > rd_amcache. In other words, it would be nice if we had "rd_indexcxt" for
> > tables, too, not just indexes. That would allow tracking memory usage
> > more accurately, if you're debugging an out of memory situation for example.
>
> We had some discussion related to that in the "hyrax
> vs. RelationBuildPartitionDesc" thread.  I'm not quite sure where
> we'll settle on that, but some redesign seems inevitable.

There wasn't any progress on this since last month, and this patch
won't make the situation any worse.  I'll mark this patch as ready for
committer, as it may save some time for people working on custom table
AM.




Re: pg_stat_database update stats_reset only by pg_stat_reset

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 01:51:50PM +0900, Michael Paquier wrote:

On Thu, Jul 11, 2019 at 04:34:20PM +0200, Daniel Verite wrote:

I can understand why you'd want that resetting the stats for a single object
would not reset the per-database timestamp, but this would revert a 8+ years
old decision that seems intentional and has apparently not been criticized
since then (based on searching for pg_stat_reset_single_table_counters in
the archives) . More opinions are probably needed in favor of this
change (or against, in which case the fate of the patch might be a
rejection).


I agree with Daniel that breaking an 8-year-old behavior may not be of
the taste of folks relying on the current behavior, particularly
because we have not had complains about the current behavior being
bad.  So -1 from me.


Yeah, I agree. There are several reasons why it's done this way:

1) overhead

Now we only store a two timestamps - for a database and for bgwriter. We
could track a timestamp for each object, of course ...

2) complexity

Updating the timestamps would be fairly simple, but what about querying
the data? Currently you fetch the data, see if the stats_reset changed
since the last snapshot, and if not you're good. If it changed, you know
some object (or the whole db) has reset counters, so you can't rely on
the data being consistent.

If we had stats_reset for each object, figuring out which data is still
valid and what has been reset would be far more complicated.

But reseting stats is not expected to be a common operation, so this
seemed like an acceptable tradeoff (and I'd argue it still is).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Detailed questions about pg_xact_commit_timestamp

2019-07-12 Thread Morris de Oryx
Adrien, thanks very much for answering my question. Just a couple of
follow-up points, if you don't mind.

In our answer, you offer an example of pg_xact_commit_timestamp showing
out-of-sequence commit times:

Session xid  pg_xact_commit_timestamp
A   34386826 2019-07-11 09:32:38.994440+00  Started earlier,
committed later
B   34386827 2019-07-11 09:32:29.806183+00

I may not have asked my question clearly, or I may not understand the
answer properly. Or both ;-) If I understand it correctly, an xid is
assigned when a transaction starts. One transaction might take a second,
another might take ten minutes. So, the xid sequence doesn't imply anything
at all about commit sequence. What I'm trying to figure out is if it is
possible for the commit timestamps to somehow be out of order. What I'm
looking for is a way of finding changes committed since a specific moment.
When the transaction started doesn't matter in my case.

Is pg_xact_commit_timestamp suitable for this? I'm getting the impression
that it isn't. But I don't understand quite how. And if it isn't suited to
this purpose, does anyone know what pg_xact_commit_timestamp is for? What
I'm after is something like a "xcommitserial" that increases reliably, and
monotonically on transaction commit. That's how I'm hoping that
pg_xact_commit_timestamp functions.

Thanks also for making me understand that pg_xact_commit_timestamp applies
to a *transaction*, not to each row. That makes it a lot lighter in the
database. I was thinking 12 bytes+ per row, which is completely off for my
case. (I tend to insert thousands of rows in a transaction.)

> Yes timestamp are stored in pg_commit_ts directory. Old timestamp are
removed after freeze has explained in
> https://www.postgresql.org/docs/current/routine-vacuuming.html

Thanks for the answer, and for kindly pointing me to the right section of
the documentation. It's easy to get impatient with new(er) users. I'm _not_
lazy about reading manuls and researching but, well, the Postgres
documentation is over 3,000 pages long when you download it. So, I may have
missed a detail or two If I read that correctly, the ~4 billion number
range is made into an endless circle by keeping ~2 billions numbers in the
past, and 2 billion in the future. If that's right, I'm never going to be
so out of data that the ~2 billion number window is too small.


Re: Request for improvement: Allow to push (t.o).id via GROUP BY ocd.o

2019-07-12 Thread Eugen Konkov
And, probably, next query belongs to same issue:

SELECT
--next_ots.group_cost AS next_cost,
(SELECT next_ots FROM order_total_suma( next_range ) next_ots 
WHERE next_ots.order_id = ots.order_id  AND next_ots.consumed_period @> 
(ots.o).billed_to
) AS next_suma,  -- << this takes 111ms only
ots.* FROM (
SELECT  
tstzrange(
  NULLIF( (ots.o).billed_to, 'infinity' ),
  NULLIF( (ots.o).billed_to +p.interval, 'infinity' )
) as next_range,
 ots.*
FROM order_total_suma() ots
LEFT JOIN period p ON p.id = (ots.o).period_id
) ots
--LEFT JOIN order_total_suma( next_range ) next_ots ON next_ots.order_id = 6154 
 --<< this is fine
--  AND next_ots.consumed_period @> (ots.o).billed_to 
--LEFT JOIN order_total_suma( next_range ) next_ots ON next_ots.order_id = 
ots.order_id  --<< this takes 11500ms
--  AND next_ots.consumed_period @> (ots.o).billed_to 
WHERE ots.order_id IN ( 6154, 10805 )


id is not pushed for LEFT JOIN


I have attached plans:



-- 
Best regards,
Eugen Konkov

plans
Description: Binary data


Add FOREIGN to ALTER TABLE in pg_dump

2019-07-12 Thread Luis Carril
Hello,
pg_dump creates plain ALTER TABLE statements even if the table is a foreign 
table, which for someone reading the dump is confusing.
This also made a difference when applying the dump if there is any plugin 
installed that hooks on ProcessUtility, because the plugin could react 
differently to ALTER TABLE than to ALTER FOREIGN TABLE.  Opinions?

An unrelated question: if I apply pgindent to a file (in this case 
pg_dump.c) and get a bunch of changes on the indentation that are not related 
to my patch, which is the accepted policy? A different patch first with only 
the indentation?  Maybe, am I using pgindent wrong?

Cheers
Luis M Carril
From a5b439b7dfdc1f9bab8d2cf1fc95a8eaedb1d83e Mon Sep 17 00:00:00 2001
From: Luis Carril 
Date: Fri, 12 Jul 2019 12:51:17 +0200
Subject: [PATCH] Add FOREIGN to ALTER statements

---
 src/bin/pg_dump/pg_dump.c | 40 ---
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 806fc78f04..72951f9a0a 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -15535,6 +15535,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 	{
 		char	   *ftoptions = NULL;
 		char	   *srvname = NULL;
+		char	   *foreign = "";
 
 		switch (tbinfo->relkind)
 		{
@@ -15568,6 +15569,8 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 	ftoptions = pg_strdup(PQgetvalue(res, 0, i_ftoptions));
 	PQclear(res);
 	destroyPQExpBuffer(query);
+
+	foreign = "FOREIGN";
 	break;
 }
 			case RELKIND_MATVIEW:
@@ -15909,7 +15912,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 	continue;
 
 appendPQExpBufferStr(q, "\n-- For binary upgrade, set up inherited constraint.\n");
-appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
   qualrelname);
 appendPQExpBuffer(q, " ADD CONSTRAINT %s ",
   fmtId(constr->dobj.name));
@@ -15930,7 +15933,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 {
 	TableInfo  *parentRel = parents[k];
 
-	appendPQExpBuffer(q, "ALTER TABLE ONLY %s INHERIT %s;\n",
+	appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s INHERIT %s;\n", foreign,
 	  qualrelname,
 	  fmtQualifiedDumpable(parentRel));
 }
@@ -16036,7 +16039,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 			if (!shouldPrintColumn(dopt, tbinfo, j) &&
 tbinfo->notnull[j] && !tbinfo->inhNotNull[j])
 			{
-appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
   qualrelname);
 appendPQExpBuffer(q, "ALTER COLUMN %s SET NOT NULL;\n",
   fmtId(tbinfo->attnames[j]));
@@ -16049,7 +16052,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 			 */
 			if (tbinfo->attstattarget[j] >= 0)
 			{
-appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
   qualrelname);
 appendPQExpBuffer(q, "ALTER COLUMN %s ",
   fmtId(tbinfo->attnames[j]));
@@ -16086,7 +16089,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
  */
 if (storage != NULL)
 {
-	appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+	appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
 	  qualrelname);
 	appendPQExpBuffer(q, "ALTER COLUMN %s ",
 	  fmtId(tbinfo->attnames[j]));
@@ -16100,7 +16103,7 @@ dumpTableSchema(Archive *fout, TableInfo *tbinfo)
 			 */
 			if (tbinfo->attoptions[j][0] != '\0')
 			{
-appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
   qualrelname);
 appendPQExpBuffer(q, "ALTER COLUMN %s ",
   fmtId(tbinfo->attnames[j]));
@@ -16223,6 +16226,7 @@ dumpAttrDef(Archive *fout, AttrDefInfo *adinfo)
 	PQExpBuffer delq;
 	char	   *qualrelname;
 	char	   *tag;
+	char	   *foreign;
 
 	/* Skip if table definition not to be dumped */
 	if (!tbinfo->dobj.dump || dopt->dataOnly)
@@ -16237,13 +16241,15 @@ dumpAttrDef(Archive *fout, AttrDefInfo *adinfo)
 
 	qualrelname = pg_strdup(fmtQualifiedDumpable(tbinfo));
 
-	appendPQExpBuffer(q, "ALTER TABLE ONLY %s ",
+	foreign = tbinfo->relkind == RELKIND_FOREIGN_TABLE ? "FOREIGN" : "";
+
+	appendPQExpBuffer(q, "ALTER %s TABLE ONLY %s ", foreign,
 	  qualrelname);
 	appendPQExpBuffer(q, "ALTER COLUMN %s SET DEFAULT %s;\n",
 	  fmtId(tbinfo->attnames[adnum - 1]),
 	  adinfo->adef_expr);
 
-	appendPQExpBuffer(delq, "ALTER TABLE %s ",
+	appendPQExpBuffer(delq, "ALTER %s TABLE %s ", foreign,
 	  qualrelname);
 	appendPQExpBuffer(delq, "ALTER COLUMN %s DROP DEFAULT;\n",
 	  fmtId(tbinfo->attnames[adnum - 1]));
@@ -16536,6 +16542,7 @@ dumpConstraint(Archive *fout, ConstraintInfo *coninfo)
 	PQExpBuffer q;
 	PQExpBuffer delq;
 	char	   *tag = NULL;
+	char	   *foreign = NULL;
 
 	/* Skip if not to be dumped */
 	if (!coninfo->dobj.dump || dopt->data

Re: Check-out mutable functions in check constraints

2019-07-12 Thread Pavel Stehule
pá 12. 7. 2019 v 13:11 odesílatel Tomas Vondra 
napsal:

> On Fri, Jul 12, 2019 at 08:55:20AM +0200, Pavel Stehule wrote:
> >Hi
> >
> >pá 12. 7. 2019 v 8:45 odesílatel Kyotaro Horiguchi <
> horikyota@gmail.com>
> >napsal:
> >
> >> Hello.
> >>
> >> As mentioned in the following message:
> >>
> >>
> >>
> https://www.postgresql.org/message-id/20190712.150527.145133646.horikyota.ntt%40gmail.com
> >>
> >> Mutable function are allowed in check constraint expressions but
> >> it is not right. The attached is a proposed fix for it including
> >> regression test.
> >>
> >> Other "constraints vs " checks do not seem to be exercised
> >> but it would be another issue.
> >>
> >
> >I think so this feature (although is correct) can breaks almost all
> >applications - it is 20 year late.
> >
>
> I'm not sure it actually breaks such appliations.
>
> Let's assume you have a mutable function (i.e. it may change return value
> even with the same parameters) and you use it in a CHECK constraint. Then
> I'm pretty sure your application is already broken in various ways and you
> just don't know it (sometimes it subtle, sometimes less so).
>

Years ago SQL functions was used for checks instead triggers - I am not
sure if this pattern was in documentation or not, but surely there was not
any warning against it.

You can see some documents with examples

CREATE OR REPLACE FUNCTION check_func(int)
RETURNS boolean AS $$
SELECT 1 FROM tab WHERE id = $1;
$$ LANGUAGE sql;

CREATE TABLE foo( ... id CHECK(check_func(id)));





> If you have a function that actually is immutable and it's just not marked
> accordingly, then that only requires a single DDL to fix that during
> upgrade. I don't think that's a massive issue.
>

These functions are stable, and this patch try to prohibit it.

Regards

Pavel

>
> That being said, I don't know whether fixing this is worth the hassle.
>
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-12 Thread Sehrope Sarkuni
On Thu, Jul 11, 2019 at 9:05 PM Bruce Momjian  wrote:
>
> On Thu, Jul 11, 2019 at 08:41:52PM -0400, Joe Conway wrote:
> > I vote for AES 256 rather than 128.
>
> Why?  This page seems to think 128 is sufficient:
>
> 
> https://crypto.stackexchange.com/questions/20/what-are-the-practical-differences-between-256-bit-192-bit-and-128-bit-aes-enc
>
> For practical purposes, 128-bit keys are sufficient to ensure 
> security.
> The larger key sizes exist mostly to satisfy some US military
> regulations which call for the existence of several distinct "security
> levels", regardless of whether breaking the lowest level is already 
> far
> beyond existing technology.
>
> We might need to run some benchmarks to determine the overhead of going
> to AES256, because I am unclear of the security value.

If the algorithm and key size is not going to be configurable then
better to lean toward the larger size, especially given the desire for
future proofing against standards evolution and potential for the
encrypted data to be very long lived. NIST recommends AES-128 or
higher but there are other publications that recommend AES-256 for
long term usage:

NIST - 2019 : Recommends AES-128, AES-192, or AES-256.
https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-131Ar2.pdf

NSA - 2016 : Recommends AES-256 for future quantum resistance.
https://apps.nsa.gov/iaarchive/library/ia-guidance/ia-solutions-for-classified/algorithm-guidance/cnsa-suite-and-quantum-computing-faq.cfm

ECRYPT - 2015 - Recommends AES-256 for future quantum resistance.
https://www.ecrypt.eu.org/csa/documents/PQC-whitepaper.pdf

ECRYPT - 2018 - Recommends AES-256 for long term use.
https://www.ecrypt.eu.org/csa/documents/D5.4-FinalAlgKeySizeProt.pdf

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/




Re: Check-out mutable functions in check constraints

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 03:44:58PM +0900, Kyotaro Horiguchi wrote:

Hello.

As mentioned in the following message:

https://www.postgresql.org/message-id/20190712.150527.145133646.horikyota.ntt%40gmail.com

Mutable function are allowed in check constraint expressions but
it is not right. The attached is a proposed fix for it including
regression test.



I think the comment in parse_expr.c is wrong:

   /*
* All SQL value functions are stable so we reject them in check
* constraint expressions.
*/
   if (pstate->p_expr_kind == EXPR_KIND_CHECK_CONSTRAINT)
   ereport(ERROR,
   (errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
errmsg("mutable functions are not allowed in check 
constraints")));

At first it claims SQL value functions are stable, but then rejects them
with a message that they're mutable.

Also, the other places use "cannot ..." messages:

   case EXPR_KIND_COLUMN_DEFAULT:
   err = _("cannot use column reference in DEFAULT expression");
   break;

so maybe these new checks should use the same style.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: Check-out mutable functions in check constraints

2019-07-12 Thread Tomas Vondra

On Fri, Jul 12, 2019 at 08:55:20AM +0200, Pavel Stehule wrote:

Hi

pá 12. 7. 2019 v 8:45 odesílatel Kyotaro Horiguchi 
napsal:


Hello.

As mentioned in the following message:


https://www.postgresql.org/message-id/20190712.150527.145133646.horikyota.ntt%40gmail.com

Mutable function are allowed in check constraint expressions but
it is not right. The attached is a proposed fix for it including
regression test.

Other "constraints vs " checks do not seem to be exercised
but it would be another issue.



I think so this feature (although is correct) can breaks almost all
applications - it is 20 year late.



I'm not sure it actually breaks such appliations.

Let's assume you have a mutable function (i.e. it may change return value
even with the same parameters) and you use it in a CHECK constraint. Then
I'm pretty sure your application is already broken in various ways and you
just don't know it (sometimes it subtle, sometimes less so).

If you have a function that actually is immutable and it's just not marked
accordingly, then that only requires a single DDL to fix that during
upgrade. I don't think that's a massive issue.

That being said, I don't know whether fixing this is worth the hassle.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services





Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-12 Thread Adrien Nayrat
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Sorry, I forgot to tick "passed" boxes.

Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-12 Thread Adrien Nayrat
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   tested, failed
Spec compliant:   tested, failed
Documentation:tested, failed

I test the latest patch attached to this thread 
(log_statement_sample_limit-3.patch). Everything looks good to me.

The new status of this patch is: Ready for Committer


Re: Request for improvement: Allow to push (t.o).id via GROUP BY ocd.o

2019-07-12 Thread Eugen Konkov
Hello

to my mind I may be done, because `id` is primary key of `o` table

Friday, July 12, 2019, 1:04:27 PM, you wrote:

> Hello PostgreSQL-development,

> something's stopping the planner from being able to deduce that
> (t.o).id is safe to push through a GROUP BY ocd.o

> SELECT * FROM (
> SELECT
> sum( t.group_suma ) OVER( PARTITION BY t.id ) 
> AS total_suma,
> --  sum( t.group_suma ) OVER( PARTITION
> BY (t.o).id ) AS total_suma,  -- For any WHERE this takes 2700ms
> *
> FROM (
> SELECT
>  sum( ocd.item_cost ) AS group_cost,
>  sum( ocd.item_suma ) AS group_suma,
>  max( (ocd.ic).consumed ) AS consumed,
>  (ocd.ic).consumed_period,
>  ocd.o,
>  (ocd.o).id
> FROM order_cost_details( tstzrange(
> '2019-04-01', '2019-05-01' ) ) ocd
> GROUP BY ocd.o, (ocd.o).id, 
> (ocd.ic).consumed_period
> ) t
> ) t
> WHERE t.id = 6154 AND t.consumed_period @>
> '2019-04-01'::timestamptz   -- This takes 2ms
> --  WHERE (t.o).id = 6154 AND t.consumed_period @>
> '2019-04-01'::timestamptz   -- This takes 2700ms


> More info is here: https://stackoverflow.com/q/57003113/4632019




-- 
Best regards,
Eugen Konkov





Request for improvement: Allow to push (t.o).id via GROUP BY ocd.o

2019-07-12 Thread Eugen Konkov
Hello PostgreSQL-development,

something's stopping the planner from being able to deduce that (t.o).id is 
safe to push through a GROUP BY ocd.o

SELECT * FROM (
SELECT
sum( t.group_suma ) OVER( PARTITION BY t.id ) 
AS total_suma,
--  sum( t.group_suma ) OVER( PARTITION BY (t.o).id 
) AS total_suma,  -- For any WHERE this takes 2700ms
*
FROM (
SELECT
 sum( ocd.item_cost ) AS group_cost,
 sum( ocd.item_suma ) AS group_suma,
 max( (ocd.ic).consumed ) AS consumed,
 (ocd.ic).consumed_period,
 ocd.o,
 (ocd.o).id
FROM order_cost_details( tstzrange( 
'2019-04-01', '2019-05-01' ) ) ocd
GROUP BY ocd.o, (ocd.o).id, 
(ocd.ic).consumed_period
) t
) t
WHERE t.id = 6154 AND t.consumed_period @> '2019-04-01'::timestamptz
   -- This takes 2ms
--  WHERE (t.o).id = 6154 AND t.consumed_period @> 
'2019-04-01'::timestamptz   -- This takes 2700ms


More info is here: https://stackoverflow.com/q/57003113/4632019

-- 
Best regards,
Eugen Konkov





Re: Add parallelism and glibc dependent only options to reindexdb

2019-07-12 Thread Julien Rouhaud
On Fri, Jul 12, 2019 at 7:57 AM Michael Paquier  wrote:
>
> On Fri, Jul 12, 2019 at 07:49:13AM +0200, Julien Rouhaud wrote:
> > It shouldn't be a problem, I reused the same infrastructure as for
> > vacuumdb.  so run_reindex_command has a new "async" parameter, so when
> > there's no parallelism it's using executeMaintenanceCommand (instead
> > of PQsendQuery) which will block until query completion.  That's why
> > there's no isFree usage at all in this case.
>
> My point is more about consistency and simplification with the case
> where n > 1 and that we could actually move the async/sync code paths
> into the same banner as the async mode waits as well until a slot is
> free, or in short when the query completes.

I attach v4 with all previous comment addressed.

I also changed to handle parallel and non-parallel case the same way.
I kept the possibility for synchronous behavior in reindexdb, as
there's an early need to run some queries in case of parallel
database-wide reindex.  It avoids to open all the connections in case
anything fails during this preliminary work, and it also avoids
another call for the async wait function.  If we add parallelism to
clusterdb (I'll probably work on that next time I have spare time),
reindexdb would be the only caller left of
executeMaintenanceCommand(), so that's something we may want to
change.

I didn't change the behavior wrt. possible deadlock if user specify
catalog objects using --index or --table and ask for multiple
connection, as I'm afraid that it'll add too much code for a little
benefit.  Please shout if you think otherwise.


0001-Export-vacuumdb-s-parallel-infrastructure-v4.patch
Description: Binary data


0002-Add-parallel-processing-to-reindexdb-v4.patch
Description: Binary data


RE: Copy data to DSA area

2019-07-12 Thread Ideriha, Takeshi
Hi, thank you for the previous two emails. 

Thomas Munro  wrote:
>What do you think about the following?  Even though I know you want to start 
>with
>much simpler kinds of cache, I'm looking ahead to the lofty end-goal of having 
>a shared
>plan cache.  No doubt, that involves solving many other problems that don't 
>belong in
>this thread, but please indulge me:

My initial motivation came from shared catcache and relcache but I also think 
shared plan
cache is one of the big topics and I'd be very excited if it's come true. 
Sometimes making 
plan at each backend becomes enormous overhead for speed.

>On Wed, Jul 10, 2019 at 6:03 PM Thomas Munro  wrote:
>> Hmm.  I wonder if we should just make ShmContextFree() do nothing! And
>> make ShmContextAlloc() allocate (say) 8KB chunks (or larger if needed
>> for larger allocation) and then hand out small pieces from the
>> 'current' chunk as needed.  Then the only way to free memory is to
>> destroy contexts, but for the use case being discussed, that might
>> actually be OK.  I suppose you'd want to call this implementation
>> something different, like ShmRegionContext, ShmZoneContext or
>> ShmArenaContext[1].
>
>
>
>I guess what I said above is only really appropriate for complex things like 
>plans that
>have their own contexts so that we can delete them easily "in bulk".  I guess 
>it's not
>true for caches of simpler objects like catcache, that don't want a context 
>for each
>cached thing and want to free objects "retail" (one by one).  So I guess you 
>might
>want something more like your current patch for (say) SharedCatCache, and
>something like the above-quoted idea for (say) SharedPlanCache or 
>SharedRelCache.

Here's what I get: 
- Temporary/Permanent concept can be implemented by following current cached 
plan
  schema of reparanting and also adding general API MemoryContextClone.

- There are at least two strategies (implementation ways) of allocation. One is 
  ShmZoneContext; region-based allocation, which drops chunks in bulk. This 
type is
  useful for cache made of complex objects graph like plan cache. This case 
clean up
  list (list of pointers to dsa_allocated objects) needs to be located in 
shared memory.
  That's because this list is used not only in error case but in case of 
DropCachedPlan.

- The other is not region based, but allocate or free objects "retail". I would 
call it
  ShmSimpleContext case. This would be suitable for, say, catalog cache.  

The type of ShmSimpleContext is OK to have clean up list in local memory, right?
Unlike ShmZoneContext, this list is only used on error and cache entry can be
freed by just traversing simple object graph as discussed before.

Regarding the tracing list for cleanup, I'd change the design based on either
ShmSimpleContext or ShmZoneContext assuming the clean-up list would be 
pointed by the context object in both type.

In type of ShmSimpleContext, the list is deleted after the context is reparent
to long lived one. The context for building, say, catcache entry is 
located in shared memory but its tracking list is in local memory.
So it seems wired to that the shared object points to local object. But there
seems no problem because other process cannot touch the context until
its reparent.

In case of ShmZoneContext, it would be ok to still keep the list after its 
reparent.
That's because both this intermediate context and the list is supposed be 
located in shared memory and used.

Regarding Intermediate level context, I was wondering if we need it in 
ShmSimpleContext case. Every time we build the catcache entry, we make 
MemoryContext object in shared memory and this becomes some memory 
Overhead. But right now context object per catcache entry seems simple 
architecture and moreover easier to operate on error case.


>For an implementation that supports retail free, perhaps you could store the 
>address of
>the clean-up list element in some extra bytes before the returned pointer, so 
>you don't
>have to find it by linear search.  Next, I suppose you don't want to leave 
>holes in the
>middle of the array, so perhaps instead of writing NULL there, you could 
>transfer the
>last item in the array to this location (with associated concurrency problems).

Sure, I'll fix it.

>Since I don't think anyone ever said it explicitly, the above discussion is 
>all about how
>we get to this situation, while making sure that we're mostly solving problems 
>that
>occur in both multi-process and multi-threaded designs:
>
>shared_metadata_cache = '100MB'
>shared_plan_cache = '100MB'

Yes, we are talking about this. Thank you for the clarification.

Regards,
Takeshi Ideriha



Re: POC: Cleaning up orphaned files using undo logs

2019-07-12 Thread Amit Kapila
On Thu, Jul 11, 2019 at 1:29 AM Robert Haas  wrote:
>
> After some off-list discussion with Andres ...
>
> Another possible approach here, which I think I like better, is to
> switch from using a binary heap to using an rbtree.  That wouldn't
> work well in DSM because of the way it uses pointers, but here we're
> putting data in the shared memory segment so it seems like it should
> work.  The idea would be to allocate an array of entries with a
> freelist, and then have allocfunc and freefunc defined to push and pop
> the freelist.  Unlike a binary heap, an rbtree lets us (a) do
> peek-ahead in sorted order and (b) delete elements from an arbitrary
> position without rebuilding anything.
>
> If we adopt this approach, then I think a bunch of the problems we've
> been talking about actually get a lot easier.  If we pull an item from
> the ordered-by-XID rbtree or the ordered-by-undo-size rbtree, we can
> remove it from the other one cheaply, because we can store a pointer
> to the RBTNode in the main object.  So then we never have any stale
> pointers in any data structure, which means we don't have to have a
> strategy to avoid accidentally following them.
>
> The fact that we can peak-ahead correctly without any new code is also
> very nice.  I'm still concerned that peeking ahead isn't the right
> approach in general, but if we're going to do it, peeking ahead to the
> actually-next-highest-priority item is a lot better than peeking ahead
> to some-item-that-may-be-fairly-high-priority.
>
> One problem which Andres spotted is that rbt_delete() can actually
> move content around, so if you just cache the RBTNode returned by
> rbt_insert(), it might not be the right one by the time you
> rbt_delete(), if other stuff has been deleted first.  There are
> several possible approaches to that problem, but one that I'm
> wondering about is modifying rbt_delete_node() so that it doesn't rely
> on rbt_copy_data.  The idea is that if y != z, instead of copying the
> data from y to z, copy the left/right/parent pointers from z into y,
> and make z's left, right, and parent nodes point to y instead.
>

I am not sure but don't we need to retain the color of z as well?

Apart from this, the duplicate key (ex. for size queues the size of
two requests can be same) handling might need some work.  Basically,
either special combine function needs to be written (not sure yet what
we should do there) or we always need to ensure that the key is unique
like (size + start_urec_ptr).  If the size is the same, then we can
decide based on start_urec_ptr.

I think we can go by changing the implementation to rbtree by doing
some enhancements instead of the binary heap or alternatively, we can
use one of the two ideas suggested by you in the email above [1] to
simplify the code and keep using the binary heap for now.  Especially,
I like the below one.
"2. However, I don't think we should have a separate request object
for each queue anyway. We should insert pointers to the same objects
in all the relevant queue (either size + XID, or else error). So
instead of having three sets of objects, one for each queue, we'd just
have one set of objects and point to them with as many as two
pointers.
We'd therefore need LESS memory than we're using today, because we
wouldn't have separate arrays for XID, size, and error queue
elements."

I think even if we currently go with a binary heap, it will be
possible to change it to rbtree later, but I am fine either way.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmoZ5g7UzMvM_42YMG8nbhOYpH%2Bu5OMMnePJkYtT5HWotUw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Minimal logical decoding on standbys

2019-07-12 Thread tushar

On 07/10/2019 05:12 PM, Amit Khandekar wrote:

All right. Will do that in the next patch set. For now, I have quickly
done the below changes in a single patch again (attached), in order to
get early comments if any.
Thanks Amit for your patch. i am able to see 1 issues  on Standby server 
- (where  logical replication slot created ) ,
a)size of  pg_wal folder  is NOT decreasing even after firing 
get_changes function
b)pg_wal files are not recycling  and every time it is creating new 
files after firing get_changes function


Here are the detailed steps -
create a directory with the name 'archive_dir' under /tmp (mkdir 
/tmp/archive_dir)

*SR setup -*
*Master*
.)Perform initdb  (./initdb -D master --wal-segsize=2)
.)Open postgresql.conf file and add these below parameters  at the end 
of  file

wal_level='logical'
min_wal_size=4MB
max_wal_size=4MB
hot_standby_feedback = on
archive_mode=on
archive_command='cp %p /tmp/archive_dir/%f'
.)Start the server ( /pg_ctl -D master/ start -l logsM -c )
.)Connect to psql , create physical  slot
    ->SELECT * FROM 
pg_create_physical_replication_slot('decoding_standby');

*Standby - *
.)Perform pg_basebackup ( ./pg_basebackup -D standby/ 
--slot=decoding_standby -R -v)
.)Open postgresql.conf file of standby and add these 2 parameters - at 
the end of file

  port=
primary_slot_name = 'decoding_standby'
.)Start the Standby server  ( ./pg_ctl -D standby/ start -l logsS -c )
.)Connect to psql terminal and create logical replication slot
  ->SELECT * from   pg_create_logical_replication_slot('standby', 
'test_decoding');


*MISC steps**-
*.)Connect to master and create table/insert rows ( create table t(n 
int); insert into t (values (1);)
.)Connect to standby and fire get_changes function ( select * from 
pg_logical_slot_get_changes('standby',null,null); )

.)Run pgbench ( ./pgbench -i -s 10 postgres)
.)Check the pg_wal directory size of  STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$

.)Connect to standby and fire get_changes function ( select * from 
pg_logical_slot_get_changes('standby',null,null); )

.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$


.)Restart both master and standby ( ./pg_ctl -D master restart -l logsM 
-c) and (./pg_ctl -D standby restart -l logsS -c )


.)Check the pg_wal directory size of STANDBY
[centos@mail-arts bin]$ du -sch standby/pg_wal/
127M    standby/pg_wal/
127M    total
[centos@mail-arts bin]$

and if we see the pg_wal files ,it is growing rampant and not reusing.

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-12 Thread Kyotaro Horiguchi
At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190712.173041.236938840.horikyota@gmail.com>
> The v16 seems no longer works so I'll send further rebased version.

It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 9bcd4acb14c5cef2d4bdf20c9be8c86597a9cf7c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/3] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 00..b26cd8efd5
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+	grep(/^[0-9]+$/,
+		 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::perl2host($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMARY KEY);
+		TRUNCATE test1;
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+	is($result, qq(0),
+	   "wal_level = $wal_level, optimized truncation with empty table");
+
+	# Test truncation with inserted tuples within the same transaction.
+	# Tuples inserted after the truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test2 (id serial PRIMARY KEY);
+		INSERT INTO test2 VALUES (DEFAULT);
+		TRUNCATE test2;
+		INSERT INTO test2 VALUES (DEFAULT);
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+	is($result, qq(1),
+	   "wal_level = $wal_level, optimized truncation with inserted table");
+
+	# Data file for COPY query in follow-up tests.
+	my $basedir = $node->basedir;
+	my $copy_file = "$basedir/copy_data.txt";
+	TestLib::append_to_file($copy_file, qq(2,3
+20001,30001
+20002,30002));
+
+	# Test truncation with inserted tuples using COPY.  Tuples copied after the
+	# truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3;
+		COPY test3 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, optimized truncation with copied table");
+
+	# Like previous test, but rollback SET TABLESPACE in a subtransaction.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a;
+		SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+		COPY test3a FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in subtran

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-07-12 Thread Kyotaro Horiguchi
Many message seem lost during moving to new environmet..
I'm digging the archive but coudn't find the message for v15..

At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch  wrote in 
<20190712010335.gb1610...@rfd.leadboat.com>
> On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> > Hello. Rebased the patch to master(bd56cd75d2).
> 
> It looks like you did more than just a rebase, because this v16 no longer
> modifies many files that v14 did modify.  (That's probably good, since you had
> pending review comments.)  What other changes did you make?

Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing.

v14: WAL-logging is controled by AMs and syncing at commit is
controled according to the behavior.  At-commit sync is still
controlled per-relation basis, which means it must be
processed before transaction state becomes TRNAS_COMMIT. So
it needs to be separated into PreCommit_RelationSync() from
AtEOXact_RelationCache().

v15: The biggest change is that at-commit sync is changed to smgr
   basis. At-commit sync is programmed at creation of a storage
   file (RelationCreateStorage), and smgrDoPendingDelete(or
   smgrDoPendingOperations after rename) runs syncs.  AM are no
   longer involved and all permanent relations are WAL-skipped at
   all in the creation transaction while wal_level=minimal.

   All storages created for a relation are once synced then
   removed at commit.

v16: rebased.

The v16 seems no longer works so I'll send further rebased version.

Sorry for the late reply and confusion..

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-07-12 Thread Amit Langote
Sorry for jumping into this thread pretty late.  I have read the
messages on this thread a couple of times and...

On Fri, Jul 12, 2019 at 3:05 PM Kyotaro Horiguchi
 wrote:
> I'm on Peter's side. Unlike other similar hooks, this hook is
> provided just to introduce arbitrary inconsistency in partition
> pruning.

...I have questions about the patch as proposed, such as what Peter
and Horiguchi-san might.

Why does this hook need to be specific to partitioning, that is, why
is this a "partition pruning" hook? IOW, why does the functionality of
this hook apply only to partitions as opposed to *any* table?  I may
be missing something, but the two things -- a table's partition
constraint and security properties (or any properties that the
proposed hook's suppliers will provide) -- seem unrelated, so making
this specific to partitioning seems odd.  If this was proposed as
applying to any table, then it might be easier to choose the point
from which hook will be called and there will be fewer such points to
consider.  Needless to say, since the planner sees partitions as
tables, the partitioning case will be covered too.

Looking at the patch, it seems that the hook will be called *after*
opening the partition (provided it survived partition pruning); I'm
looking at this hunk:

@@ -1038,6 +1038,13 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
  continue;
  }

+ if (!InvokePartitionPruneHook(childRTE->relid))
+ {
+ /* Implement custom partition pruning filter*/
+ set_dummy_rel_pathlist(childrel);
+ continue;
+ }

set_append_rel_size() is called *after* partitions are opened and
their RelOptInfos formed.  So it's not following the design you
intended whereby the hook will avoid uselessly opening partition
files.

Also, I suspect you don't want to call the hook from here:

@@ -92,6 +93,10 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
  while ((inheritsTuple = systable_getnext(scan)) != NULL)
  {
  inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
+
+ if (!InvokePartitionPruneHook(inhrelid))
+ continue;
+

...as you might see unintended consequences, because
find_inheritance_children() is called from many places that wouldn't
want arbitrary extension code to drop some children from the list. For
example, it is called when adding a column to a parent table to find
the children that will need to get the same column added.  You
wouldn't want some children getting the column added while others not.

Thanks,
Amit




Re: Comment fix of config_default.pl

2019-07-12 Thread Michael Paquier
On Fri, Jul 12, 2019 at 03:34:11PM +0900, Kyotaro Horiguchi wrote:
> At Fri, 12 Jul 2019 13:01:13 +0900, Michael Paquier  
> wrote in <20190712040113.gd2...@paquier.xyz>
>> --with-ossp-uuid is an obsolete spelling.  Wouldn't it be better to
>> replace it with --with-uuid=?  That would be a bit inconsistent
> 
> Oops! Right. My eyes slipped over the difference..

I would also patch GetFakeConfigure in Solution.pm (no need to send a
new patch), and I thought that you'd actually do the change.  What do
you think?
--
Michael


signature.asc
Description: PGP signature


Re: benchmarking Flex practices

2019-07-12 Thread John Naylor
On Wed, Jul 10, 2019 at 3:15 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > [ v4 patches for trimming lexer table size ]
>
> I reviewed this and it looks pretty solid.  One gripe I have is
> that I think it's best to limit backup-prevention tokens such as
> quotecontinuefail so that they match only exact prefixes of their
> "success" tokens.  This seems clearer to me, and in at least some cases
> it can save a few flex states.  The attached v5 patch does it like that
> and gets us down to 22331 states (from 23696).  In some places it looks
> like you did that to avoid writing an explicit "{other}" match rule for
> an exclusive state, but I think it's better for readability and
> separation of concerns to go ahead and have those explicit rules
> (and it seems to make no difference table-size-wise).

Looks good to me.

> We still need to propagate these changes into the psql and ecpg lexers,
> but I assume you were waiting to agree on the core patch before touching
> those.  If you're good with the changes I made here, have at it.

I just made a couple additional cosmetic adjustments that made sense
when diff'ing with the other scanners. Make check-world passes. Some
notes:

The pre-existing ecpg var "state_before" was a bit confusing when
combined with the new var "state_before_quote_stop", and the former is
also used with C-comments, so I decided to go with
"state_before_lit_start" and "state_before_lit_stop". Even though
comments aren't literals, it's less of a stretch than referring to
quotes. To keep things consistent, I went with the latter var in psql
and core.

To get the regression tests to pass, I had to add this:

 psql_scan_in_quote(PsqlScanState state)
 {
- return state->start_state != INITIAL;
+ return state->start_state != INITIAL &&
+ state->start_state != xqs;
 }

...otherwise with parens we sometimes don't get the right prompt and
we get empty lines echoed. Adding xuend and xuchar here didn't seem to
make a difference. There might be something subtle I'm missing, so I
thought I'd mention it.

With the unicode escape rules brought over, the diff to the ecpg
scanner is much cleaner now. The diff for C-comment rules were still
pretty messy in comparison, so I made an attempt to clean that up in
0002. A bit off-topic, but I thought I should offer that while it was
fresh in my head.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


v6-0001-Reduce-the-number-of-states-in-the-core-scanner-t.patch
Description: Binary data


v6-0002-Merge-ECPG-scanner-states-for-C-style-comments.patch
Description: Binary data


Re: Remove page-read callback from XLogReaderState.

2019-07-12 Thread Kyotaro Horiguchi
At Tue, 28 May 2019 04:45:24 -0700, Andres Freund  wrote in 
<20190528114524.dvj6ymap2virl...@alap3.anarazel.de>
> Hi,
> 
> On 2019-04-18 21:02:57 +0900, Kyotaro HORIGUCHI wrote:
> > Hello. As mentioned before [1], read_page callback in
> > XLogReaderState is a cause of headaches. Adding another
> > remote-controlling stuff to xlog readers makes things messier [2].
> > 
> > I refactored XLOG reading functions so that we don't need the
> > callback. In short, ReadRecrod now calls XLogPageRead directly
> > with the attached patch set.
> > 
> > | while (XLogReadRecord(xlogreader, RecPtr, &record, &errormsg)
> > |== XLREAD_NEED_DATA)
> > | XLogPageRead(xlogreader, fetching_ckpt, emode, randAccess);
> > 
> > On the other hand, XLogReadRecord became a bit complex. The patch
> > makes XLogReadRecord a state machine. I'm not confident that the
> > additional complexity is worth doing. Anyway I'll gegister this
> > to the next CF.
> 
> Just FYI, to me this doesn't clearly enough look like an improvement,
> for a change of this size.

Thanks for the opiniton. I kinda agree about size but it is a
decision between "having multiple callbacks called under the
hood" vs "just calling a series of functions".  I think the
patched XlogReadRecord is easy to use in many situations.

It would be better if I could completely refactor the function
without the syntax tricks but I think the current patch is still
smaller and clearer than overhauling it.

If many of the folks think that adding a callback is better than
this refactoring, I will withdraw this..

reagrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Memory-Bounded Hash Aggregation

2019-07-12 Thread Tomas Vondra

On Thu, Jul 11, 2019 at 06:06:33PM -0700, Jeff Davis wrote:

On Thu, 2019-07-11 at 17:55 +0200, Tomas Vondra wrote:

Makes sense. I haven't thought about how the hybrid approach would be
implemented very much, so I can't quite judge how complicated would
it be
to extend "approach 1" later. But if you think it's a sensible first
step,
I trust you. And I certainly agree we need something to compare the
other
approaches against.


Is this a duplicate of your previous email?



Yes. I don't know how I managed to send it again. Sorry.


I'm slightly confused but I will use the opportunity to put out another
WIP patch. The patch could use a few rounds of cleanup and quality
work, but the funcionality is there and the performance seems
reasonable.

I rebased on master and fixed a few bugs, and most importantly, added
tests.

It seems to be working with grouping sets fine. It will take a little
longer to get good performance numbers, but even for group size of one,
I'm seeing HashAgg get close to Sort+Group in some cases.



Nice! That's a very nice progress!


You are right that the missed lookups appear to be costly, at least
when the data all fits in system memory. I think it's the cache misses,
because sometimes reducing work_mem improves performance. I'll try
tuning the number of buckets for the hash table and see if that helps.
If not, then the performance still seems pretty good to me.

Of course, HashAgg can beat sort for larger group sizes, but I'll try
to gather some more data on the cross-over point.



Yes, makes sense. I think it's acceptable as long as we consider this
during costing (when we know in advance we'll need this) or treat it to be
emergency measure.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services