date:20200612

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 6:11 PM Magnus Hagander  wrote:
>
> On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila  wrote:
>>
>
>
> The problem with "lifetime of a process" is that it's not predictable. A 
> replication process might "bounce" for any reason, and it is normally not a 
> problem. But if you suddenly lose your stats when you do that, it starts to 
> matter a lot more. Especially when you don't know if it bounced. (Sure you 
> can look at the backend_start time, but that adds a whole different sets of 
> complexitites).
>

It is not clear to me what is a good way to display the stats for a
process that has exited or bounced due to whatever reason.  OTOH, if
we just display per-slot stats, it is difficult to imagine how the
user can make any sense out of it or in other words how such stats can
be useful to users.

>
> > > The view
>>
>> > will have the following columns for example?
>> >
>> > * pid
>> > * slot_name
>> > * spill_txns
>> > * spill_count
>> > * spill_bytes
>> > * exec_count
>> >
>>
>> Yeah, these appear to be what I have in mind.  Note that we can have
>> multiple entries of the same pid here because of slotname, there is
>> some value to display slotname but I am not completely sure if that is
>> a good idea but I am fine if you have a reason to include slotname?
>
>
> Well, it's a general view so you can always GROUP BY that away if you want at 
> reading point?
>

Okay, that is a valid point.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 6:24 PM Masahiko Sawada
 wrote:
>
> On Fri, 12 Jun 2020 at 19:24, Amit Kapila  wrote:
> >
> > > > Which are these other online transactions?  I had assumed that foreign
> > > > transaction resolver process is to resolve in-doubt transactions but
> > > > it seems it is also used for some other purpose which anyway was the
> > > > next question I had while reviewing other sections of docs but let's
> > > > clarify as it came up now.
> > >
> > > When a distributed transaction is committed by COMMIT command, the
> > > postgres backend process prepare all foreign transaction and commit
> > > the local transaction.
> > >
>
> Thank you for your review comments! Let me answer your question first.
> I'll see the review comments.
>
> >
> > Does this mean that we will mark the xid as committed in CLOG of the
> > local server?
>
> Well what I meant is that when the client executes COMMIT command, the
> backend executes PREPARE TRANSACTION command on all involved foreign
> servers and then marks the xid as committed in clog in the local
> server.
>

Won't it create an inconsistency in viewing the data from the
different servers?  Say, such a transaction inserts one row into a
local server and another into the foreign server.  Now, if we follow
the above protocol, the user will be able to see the row from the
local server but not from the foreign server.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: doc examples for pghandler

2020-06-12 Thread Michael Paquier

On Fri, Jun 12, 2020 at 10:13:41PM -0400, Tom Lane wrote:
> On second thought, contrib/ is not quite the right place, because we
> typically expect modules there to actually get installed, meaning they
> have to have at least some end-user usefulness.  The right place for
> a toy PL handler is probably src/test/modules/; compare for example
> src/test/modules/test_parser/, which is serving quite the same sort
> of purpose as a skeleton text search parser.

+1 for src/test/modules/, and if you can provide some low-level API
coverage through this module, that's even better.
--
Michael


signature.asc
Description: PGP signature

Re: Parallel copy

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 4:57 PM Ashutosh Sharma  wrote:
>
> Hi All,
>
> I've spent little bit of time going through the project discussion that has 
> happened in this email thread and to start with I have few questions which I 
> would like to put here:
>
> Q1) Are we also planning to read the input data in parallel or is it only 
> about performing the multi-insert operation in parallel? AFAIU, the data 
> reading part will be done by the leader process alone so no parallelism is 
> involved there.
>

Yes, your understanding is correct.

> Q2) How are we going to deal with the partitioned tables?
>

I haven't studied the patch but my understanding is that we will
support parallel copy for partitioned tables with a few restrictions
as explained in my earlier email [1].  See, Case-2 (b) in the email.

> I mean will there be some worker process dedicated for each partition or how 
> is it?

No, it the split is just based on the input, otherwise each worker
should insert as we would have done without any workers.

> Q3) Incase of toast tables, there is a possibility of having a single tuple 
> in the input file which could be of a very big size (probably in GB) 
> eventually resulting in a bigger file size. So, in this case, how are we 
> going to decide the number of worker processes to be launched. I mean, 
> although the file size is big, but the number of tuples to be processed is 
> just one or few of them, so, can we decide the number of the worker processes 
> to be launched based on the file size?
>

Yeah, such situations would be tricky, so we should have an option for
user to specify the number of workers.

> Q4) Who is going to process constraints (preferably the deferred constraint) 
> that is supposed to be executed at the COMMIT time? I mean is it the leader 
> process or the worker process or in such cases we won't be choosing the 
> parallelism at all?
>

In the first version, we won't do parallelism for this.  Again, see
one of my earlier email [1] where I have explained this and other
cases where we won't be supporting parallel copy.

> Q5) Do we have any risk of table bloating when the data is loaded in 
> parallel. I am just asking this because incase of parallelism there would be 
> multiple processes performing bulk insert into a table. There is a chance 
> that the table file might get extended even if there is some space into the 
> file being written into, but that space is locked by some other worker 
> process and hence that might result in a creation of a new block for that 
> table. Sorry, if I am missing something here.
>

Hmm, each worker will operate at page level, after first insertion,
the same worker will try to insert in the same page in which it has
inserted last, so there shouldn't be such a problem.

> Please note that I haven't gone through all the emails in this thread so 
> there is a possibility that I might have repeated the question that has 
> already been raised and answered here. If that is the case, I am sorry for 
> that, but it would be very helpful if someone could point out that thread so 
> that I can go through it. Thank you.
>

No problem, I understand sometimes it is difficult to go through each
and every email especially when the discussion is long.  Anyway,
thanks for showing the interest in the patch.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BANNEaMJCCXm4naweP5PLY6LhJMvGo_V7-Pnfbh6GsOA%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2020-06-12 Thread Pavel Stehule

so 13. 6. 2020 v 1:28 odesílatel Andres Freund  napsal:

> Hi,
>
> Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
> the resulting timing enough that the times aren't meaningful. E.g.
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 5000);
> VACUUM FREEZE lotsarows;
>
>
> -- best of three:
> SELECT count(*) FROM lotsarows;
> Time: 1923.394 ms (00:01.923)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
> Time: 2319.830 ms (00:02.320)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4202.649 ms (00:04.203)
>
> That nearly *double* the execution time without TIMING.
>
>
> Looking at a profile of this shows that we spend a good bit of cycles
> "normalizing" timstamps etc. That seems pretty unnecessary, just forced
> on us due to struct timespec. So the first attached patch just turns
> instr_time to be a 64bit integer, counting nanoseconds.
>
> That helps, a tiny bit:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4179.302 ms (00:04.179)
>
> but obviously doesn't move the needle.
>
>
> Looking at a profile it's easy to confirm that we spend a lot of time
> acquiring time:
> -   95.49% 0.00%  postgres postgres [.]
> agg_retrieve_direct (inlined)
>- agg_retrieve_direct (inlined)
>   - 79.27% fetch_input_tuple
>  - ExecProcNode (inlined)
> - 75.72% ExecProcNodeInstr
>+ 25.22% SeqNext
>- 21.74% InstrStopNode
>   + 17.80% __GI___clock_gettime (inlined)
>- 21.44% InstrStartNode
>   + 19.23% __GI___clock_gettime (inlined)
>+ 4.06% ExecScan
>   + 13.09% advance_aggregates (inlined)
> 1.06% MemoryContextReset
>
> And that's even though linux avoids a syscall (in most cases) etc to
> acquire the time. Unless the kernel detects there's a reason not to do
> so, linux does this by using 'rdtscp' and multiplying it by kernel
> provided factors to turn the cycles into time.
>
> Some of the time is spent doing function calls, dividing into struct
> timespec, etc. But most of it just the rdtscp instruction:
>  65.30 │1  63:   rdtscp
>
>
> The reason for that is largely that rdtscp waits until all prior
> instructions have finished (but it allows later instructions to already
> start). Multiple times for each tuple.
>
>
> In the second attached prototype patch I've change instr_time to count
> in cpu cycles instead of nanoseconds. And then just turned the cycles
> into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).
>
> When using rdtsc that results in *vastly* lower overhead:
>
> ┌───┐
> │  QUERY PLAN
>  │
>
> ├───┤
> │ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual
> time=2610.235..2610.235 rows=1 loops=1) │
> │   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=5016
> width=0) (actual time=0.006..1512.886 rows=5000 loops=1) │
> │ Planning Time: 0.028 ms
>  │
> │ Execution Time: 2610.256 ms
>  │
>
> └───┘
> (4 rows)
>
> Time: 2610.589 ms (00:02.611)
>
> And there's still some smaller improvements that could be made ontop of
> that.
>
> As a comparison, here's the time when using rdtscp directly in
> instr_time, instead of going through clock_gettime:
> Time: 3481.162 ms (00:03.481)
>
> That shows pretty well how big the cost of the added pipeline stalls
> are, and how important out-of-order execution is for decent
> performance...
>
>
> In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
> do *not* want to wait for prior instructions to finish, since that
> actually leads to the timing being less accurate, rather than
> more. There are other cases where that'd be different, e.g. measuring
> how long an entire query takes or such (but there it's probably
> irrelevant which to use).
>
>
> I've above skipped a bit over the details of how to turn the cycles
> returned by rdtsc into time:
>
> On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
> have actually been run, but instead returns the number of 'reference
> cycles'. That's important because otherwise things like turbo mode and
> lower power modes would lead to completely bogus times.
>
> Thus, knowing the "base frequency" of the CPU allows to turn the
>

Re: doc examples for pghandler

2020-06-12 Thread Tom Lane

Mark Wong  writes:
> On Fri, Jun 12, 2020 at 03:10:20PM -0400, Tom Lane wrote:
>> I wonder if it'd be possible to adapt what you have here into some
>> tiny contrib module that doesn't do very much useful, but can at
>> least be tested to see that it compiles; moreover it could be
>> copied verbatim to serve as a starting point for a new PL.

> I do have the code examples in a repo. [1]  The 0.4 directory consists
> of everything the examples show.  

> It would be easy enough to adapt that for contrib, and move some of the
> content from the doc patch into that.  Then touch up the handler chapter
> to reference the contrib module.

On second thought, contrib/ is not quite the right place, because we
typically expect modules there to actually get installed, meaning they
have to have at least some end-user usefulness.  The right place for
a toy PL handler is probably src/test/modules/; compare for example
src/test/modules/test_parser/, which is serving quite the same sort
of purpose as a skeleton text search parser.

regards, tom lane

Re: what can go in root.crt ?

2020-06-12 Thread Chapman Flack

On 06/12/20 16:17, Chapman Flack wrote:
> reason. Typically that reason is that it has been placed in a file that
> can only be edited by the admin who decides what certs to trust. By
> editing it into that file, that responsible person has vouched for it,
> and /that/ is why the TLS client should trust it.

In order to wave my hands less, and map more easily onto the RFCs,
I ought to start saying these things:

Relying Party   when I mean libpq in its role of validating a server

Trust Anchor Store  when I mean libpq's root.crt file

Trust Anchor Manager  when I mean me, putting a thing into root.crt

Trust Anchorwhen I mean a thing I put into root.crt

Target Certificate  the one the Relying Party wants to validate; in this
case, the end-entity cert assigned to the pgsql server

Certification Path Validation   the algorithm in RFC 5280 sec. 6

Certification Path Building the task described in RFC 4158

RFC 5280 expresses the Path Validation algorithm as starting from
a Trust Anchor and proceeding toward the Target Certificate. In this
thread so far I've been waving my hands in the other direction, but
that properly falls under Path Building.

If your Trust Anchor Store contains only one Trust Anchor, then the
Path Validation algorithm is all you need. If there may be multiple
Trust Anchors there, Path Building is the process of enumerating
possible paths with a Trust Anchor at one end and the Target Certificate
at the other, in the hope that Path Validation will succeed for at least
one of them.

RFC 4158 isn't prescriptive: it doesn't give one way to build paths, but
a smörgåsbord of approaches. Amusingly, what it calls "forward path
building" is when you start from the Target Certificate and search toward
a Trust Anchor (same way I've been waving my hands, but reversed w.r.t.
Path Validation), and what it calls "reverse path building" is when you
start with your Trust Anchors and search toward the Target Certificate
(the same direction as Path Validation).

RFC 4158 has an extensive gallery of ASCII art showing what the PKI
can end up looking like in some large enterprises.  :O

Initial inputs to Path Validation include a distinguished name and
a public key, optionally with some constraints, and those things come
from a Trust Anchor. There is no requirement that a Trust Anchor be
a cert, signed, self-signed, or otherwise. The certificate format has
often been used as a Trust Anchor container format because, hey, it
holds a distinguished name and a public key and constraints, and if
you're writing a path validator, you already have a parser for it.

Other things that happen to be present in a certificate-as-Trust-Anchor,
such as an issuer name, key, and signature, are non-inputs to the path
validation algorithm and have no effect on it.

Disappointingly, common implementations have tended also to ignore
constraints held in a certificate-as-Trust-Anchor, even though they
correctly apply constraints in other certs encountered along the path,
and initial constraints are supposed to be inputs to the algorithm.
It is the point of RFC 5937 to fix that.

'Constraints' in this context are limits such as "this cert is for
identifying servers" or "this is a CA cert but only allowed to sign
certs for *.example.com". The RFCs plainly anticipate that I might
want to put new constraints on a Trust Anchor, say to use the cert
of a CA that has other customers, without implying a trust relationship
with their other customers.

For that purpose, the historical 'convenience' of using certificates
as Trust Anchor containers is a genuine hindrance, because of course
certificates are cryptographically signed objects so editing their
constraints can't be easily done.[1]

The Trust Anchor Format (cited in [1] as "in progress" but since published
as RFC 5914) therefore proposes a couple alternatives to the use of a
certificate as an ersatz Trust Anchor container.

RFC 6024, Trust Anchor Management Requirements, sets out the considerations
RFC 5914 and RFC 5934 were to address. (In classic see-I-did-it-perfectly
fashion, it was published after they were.)

So, if libpq had a Trust Anchor Store that worked as described in these
RFCs, my use case would be trivial to set up.

I guess the next question is to what extent recent OpenSSL groks those,
or how far back was the first version that did (these RFCs are from 2010),
and what would be entailed in taking advantage of that support if it's
present.

Regards,
-Chap

[1] https://dl.acm.org/doi/10.1145/1750389.1750403

Re: Add tap test for --extra-float-digits option

2020-06-12 Thread Michael Paquier

On Fri, Jun 12, 2020 at 06:15:36PM +0900, Dong Wook Lee wrote:
> Oh, now I understand. and I added a test of --row-per-insert option.

That's more of an habit to look around, find similar patterns and the
check if these are covered.

I have applied your patch, and you may want to be careful about a
couple of things:
- Please avoid top-posting on the mailing lists:
https://en.wikipedia.org/wiki/Posting_style#Top-posting
Top-posting breaks the logic of a thread.
- Your patch format is good.  When sending a new version of the patch,
it may be better to send things as a complete diff on the master
branch (or the branch you are working on), instead of just sending one
patch that applies on top of something you sent previously.  Here for
example your patch 0002 applied on top of 0001 that was sent at the
top of the thread.  We have also guidelines about patch submission:
https://wiki.postgresql.org/wiki/Submitting_a_Patch

Thanks!
--
Michael


signature.asc
Description: PGP signature

Re: Speedup usages of pg_*toa() functions

2020-06-12 Thread David Rowley

On Thu, 11 Jun 2020 at 18:52, Andrew Gierth  wrote:
>
> > "David" == David Rowley  writes:
>
>  David> Pending any objections, I'd like to push both of these patches
>  David> in the next few days to master.
>
> For the second patch, can we take the opportunity to remove the
> extraneous blank line at the top of pg_ltoa, and add the two missing
> "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ?
>
>  David> Anyone object to changing the signature of these functions in
>  David> 0002, or have concerns about allocating the maximum memory that
>  David> we might require in int8out()?
>
> Changing the function signatures seems safe enough. The memory thing
> only seems likely to be an issue if you allocate a lot of text strings
> for bigint values without a context reset, and I'm not sure where that
> would happen (maybe passing large bigint arrays to pl/perl or pl/python
> would do it?)

I ended up chickening out of doing the larger allocation
unconditionally. Instead, I pushed the original idea of doing the
palloc/memcpy of the length returned by pg_lltoa.  That gets us most
of the gains without the change in memory usage behaviour.

Thanks for your reviews on this.

David

Re: hashagg slowdown due to spill changes

2020-06-12 Thread Andres Freund

Hi,

On 2020-06-13 01:06:25 +0200, Tomas Vondra wrote:
> I agree, we should revert 4cad2534da and only project tuples when we
> actually need to spill them.

There are cases where projecting helps for non-spilling aggregates too,
but only for the representative tuple. It doesn't help in the case at
hand, because there's just 5 hashtable entries but millions of rows. So
we're unnecessarily projecting all-5 rows. But when there are many
different groups, it'd be different, because then the size of the
representative tuple can matter substantially.

Do you think we should tackle this for 13? To me 4cad2534da seems like a
somewhat independent improvement to spillable hashaggs.

Greetings,

Andres Freund

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2020-06-12 Thread Andres Freund

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 5000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49% 0.00%  postgres postgres [.] 
agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
  - 79.27% fetch_input_tuple
 - ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
   + 25.22% SeqNext
   - 21.74% InstrStopNode
  + 17.80% __GI___clock_gettime (inlined)
   - 21.44% InstrStartNode
  + 19.23% __GI___clock_gettime (inlined)
   + 4.06% ExecScan
  + 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───┐
│  QUERY PLAN   
│
├───┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual 
time=2610.235..2610.235 rows=1 loops=1) │
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=5016 width=0) 
(actual time=0.006..1512.886 rows=5000 loops=1) │
│ Planning Time: 0.028 ms   
│
│ Execution Time: 2610.256 ms   
│
└───┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since

Re: hashagg slowdown due to spill changes

2020-06-12 Thread Tomas Vondra


On Fri, Jun 12, 2020 at 03:29:08PM -0700, Jeff Davis wrote:

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:

I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to
rethink
4cad2534da6.


One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:


https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com

Which makes sense, but it's also more code.



I agree, we should revert 4cad2534da and only project tuples when we
actually need to spill them. Did any of the WIP patches actually
implement that, or do we need to write that patch from scratch?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speedup usages of pg_*toa() functions

2020-06-12 Thread David Rowley

On Thu, 11 Jun 2020 at 18:52, Andrew Gierth  wrote:
> For the second patch, can we take the opportunity to remove the
> extraneous blank line at the top of pg_ltoa, and add the two missing
> "extern"s in builtins.h for pg_ultoa_n and pg_ulltoa_n ?

I think since we've branched for PG14 now that fixing those should be
backpatched to PG13.

David

Re: doc examples for pghandler

2020-06-12 Thread Mark Wong

On Fri, Jun 12, 2020 at 03:10:20PM -0400, Tom Lane wrote:
> Mark Wong  writes:
> > Would some additional procedure language handler code examples in the
> > documentation be good to add?  I've put some together in the attached
> > patch, and can log it to a future commitfest if people think it would
> > be a good addition.
> 
> Hmm.  The existing doc examples are really pretty laughable, because
> there's such a large gap between the offered skeleton and a workable
> handler.  So I agree it'd be nice to do better, but I'm suspicious of
> having large chunks of sample code in the docs --- that's a maintenance
> problem, if only because we likely won't notice when we break it.
> Also, if somebody is hoping to copy-and-paste such code, it isn't
> that nice to work from if it's embedded in SGML.
> 
> I wonder if it'd be possible to adapt what you have here into some
> tiny contrib module that doesn't do very much useful, but can at
> least be tested to see that it compiles; moreover it could be
> copied verbatim to serve as a starting point for a new PL.

I do have the code examples in a repo. [1]  The 0.4 directory consists
of everything the examples show.  

It would be easy enough to adapt that for contrib, and move some of the
content from the doc patch into that.  Then touch up the handler chapter
to reference the contrib module.

Does that sound more useful?

[1] https://gitlab.com/markwkm/yappl

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/

Re: hashagg slowdown due to spill changes

2020-06-12 Thread Jeff Davis

On Fri, 2020-06-12 at 14:37 -0700, Andres Freund wrote:
> I don't see why it's ok to force an additional projection in the very
> common case of hashaggs over a few rows. So I think we need to
> rethink
> 4cad2534da6.

One possibility is to project only spilled tuples, more similar to
Melanie's patch from a while ago:


https://www.postgresql.org/message-id/caakru_aefesv+ukqwqu+ioenoil2lju9diuh9br8mbyxuz0...@mail.gmail.com

Which makes sense, but it's also more code.

Regards,
Jeff Davis

Re: Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-12 Thread Ranier Vilela

Em sex., 12 de jun. de 2020 às 15:15, Ranier Vilela 
escreveu:

> Posgres13_beta1, is consistently writing to the logs, "could not rename
> temporary statistics file".
> When analyzing the source that writes the log, I simplified the part that
> writes the logs a little.
>
> 1. I changed from if else if to if
> 2. For the user, better to have more errors recorded, which can help in
> discovering the problem
> 3. Errors are independent of each other
> 4. If I can't release tmpfile, there's no way to delete it (unlink)
> 5. If I can rename, there is no need to delete it (unlink) tmpfile.
>
Fix for case 5.


001_simplifies_write_statsfiles_v2.patch
Description: Binary data

Re: hashagg slowdown due to spill changes

2020-06-12 Thread Andres Freund

Hi,

On 2020-06-11 11:14:02 -0700, Jeff Davis wrote:
> On Thu, 2020-06-11 at 10:45 -0700, Andres Freund wrote:
> > Did you run any performance tests?
>
> Yes, I reproduced your ~12% regression from V12, and this patch nearly
> eliminated it for me.

I spent a fair bit of time looking at the difference. Jeff had let me
know on chat that he was still seeing some difference, but couldn't
quite figure out where that was.

Trying it out myself, I observed that the patch helped, but not that
much. After a bit I found one major reason for why:
LookupTupleHashEntryHash() assigned the hash to pointer provided by the
caller's before doing the insertion. That ended up causing a pipeline
stall (I assume it's store forwarding, but not sure). Moving the
assignment to the caller variable to after the insertion got rid of
that.

It got within 3-4% after that change. I did a number of small
microoptimizations that each helped, but didn't get quite get to the
level of 12.

Finally I figured out that that's due to an issue outside of nodeAgg.c
itself:

commit 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0
Author: Tomas Vondra 
Date:   2020-05-31 14:43:13 +0200

Use CP_SMALL_TLIST for hash aggregate

Due to this change we end up with an additional projection in queries
like this:

postgres[212666][1]=# \d fewgroups_many_rows
 Table "public.fewgroups_many_rows"
┌┬─┬───┬──┬─┐
│ Column │  Type   │ Collation │ Nullable │ Default │
├┼─┼───┼──┼─┤
│ cat│ integer │   │ not null │ │
│ val│ integer │   │ not null │ │
└┴─┴───┴──┴─┘

postgres[212666][1]=# explain SELECT cat, count(*) FROM fewgroups_many_rows 
GROUP BY 1;
┌───┐
│  QUERY PLAN   
│
├───┤
│ HashAggregate  (cost=1942478.48..1942478.53 rows=5 width=12)  
│
│   Group Key: cat  
│
│   ->  Seq Scan on fewgroups_many_rows  (cost=0.00..1442478.32 rows=10032 
width=4) │
└───┘
(3 rows)

as 'val' is "projected away"..


After neutering the tlist change, Jeff's patch and my changes to it
yield performance *above* v12.


I don't see why it's ok to force an additional projection in the very
common case of hashaggs over a few rows. So I think we need to rethink
4cad2534da6.

Greetings,

Andres Freund
diff --git i/src/include/executor/executor.h w/src/include/executor/executor.h
index c7deeac662f..415e117407c 100644
--- i/src/include/executor/executor.h
+++ w/src/include/executor/executor.h
@@ -139,7 +139,7 @@ extern TupleHashTable BuildTupleHashTableExt(PlanState *parent,
 			 MemoryContext tempcxt, bool use_variable_hash_iv);
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 		   TupleTableSlot *slot,
-		   bool *isnew);
+		   bool *isnew, uint32 *hash);
 extern uint32 TupleHashTableHash(TupleHashTable hashtable,
  TupleTableSlot *slot);
 extern TupleHashEntry LookupTupleHashEntryHash(TupleHashTable hashtable,
diff --git i/src/backend/executor/execGrouping.c w/src/backend/executor/execGrouping.c
index 8be36ca7634..1e582832ea0 100644
--- i/src/backend/executor/execGrouping.c
+++ w/src/backend/executor/execGrouping.c
@@ -22,11 +22,12 @@
 #include "utils/memutils.h"
 
 static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tuple1, const MinimalTuple tuple2);
-static uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
+static inline uint32 TupleHashTableHash_internal(struct tuplehash_hash *tb,
 		  const MinimalTuple tuple);
-static TupleHashEntry LookupTupleHashEntry_internal(TupleHashTable hashtable,
-	TupleTableSlot *slot,
-	bool *isnew, uint32 hash);
+static inline TupleHashEntry LookupTupleHashEntry_internal(
+	TupleHashTable hashtable,
+	TupleTableSlot *slot,
+	bool *isnew, uint32 hash);
 
 /*
  * Define parameters for tuple hash table code generation. The interface is
@@ -291,6 +292,9 @@ ResetTupleHashTable(TupleHashTable hashtable)
  * If isnew is NULL, we do not create new entries; we return NULL if no
  * match is found.
  *
+ * If hash is not NULL, we set it to the calculated hash value. This allows
+ * callers access to the hash value even if no entry is returned.
+ *
  * If isnew isn't NULL, then a new entry is created if no existing entry
  * matches.  On return, *isnew is true if the entry is newly created,
  * false if it existed already.  ->additional_data in the new entry has
@@ -298,11 +302,11 @@ ResetTupleHashTable(TupleHashTable hashtable)
  */
 TupleHashEntry

Re: Internal key management system

2020-06-12 Thread Fabien COELHO




Hello Bruce.


The question is what should be put in the protocol, and I would tend to
think that some careful design time should be put in it.


I still don't see the value of this vs. its complexity.


Dunno. I'm looking for the value of having such a thing in core.

ISTM that there are no clear design goals of the system, no clear 
description of the target use case(s), no clear explanations of the 
underlying choices (in something like a README), no saying what it 
achieves and what it does not achieve. It is only code.


If the proposed thing is very specific to one use case, which may be more 
or less particular, then I'd say the stuff should really be an external 
extension, and you do not need to ask for a review. Call it pgcryptoXYZ 
and it is done.


However, if the stuff is amenable to many/more use cases, then it may 
still be an extension because it is specialized somehow and not everyone 
would like to have it if they do not use it, but having it in core would 
be much more justified. Also, it would have to be a little more "complex" 
to be extensible, sure. I do not think that it needs to be very complex in 
the end, but it needs to be carefully designed to be extensible.


Note I still do not see why it should be in core directly, i.e. not an 
extension. I'm yet to see a convincing argument about that.


--
Fabien.

Re: Internal key management system

2020-06-12 Thread Fabien COELHO



Hello Masahiko-san,


Summarizing the discussed points so far, I think that the major
advantage points of your idea comparing to the current patch's
architecture are:

* More secure. Because it never loads KEK in postgres processes we can
lower the likelihood of KEK leakage.


Yes.


* More extensible. We will be able to implement more protocols to
outsource other operations to the external place.


Yes.


On the other hand, here are some downsides and issues:

* The external place needs to manage more encryption keys than the
current patch does.


Why? If the external place is just a separate process on the same host, 
probably it would manage the very same amount as what your patch.


Some cloud key management services are charged by the number of active 
keys and key operations. So the number of keys postgres requires affects 
the charges. It'd be worse if we were to have keys per table.


Possibly. Note that you do not have to use a cloud storage paid as a 
service. However, you could do it if there is an interface, because it 
would allow postgres to do so if the user wishes that. That is the point 
of having an interface that can be implemented differently for different 
use cases.



* If this approach supports only GET protocol, the user needs to
create encryption keys with appropriate ids in advance so that
postgres can get keys by id. If postgres TDE creates keys as needed,
CREATE protocol would also be required.


I'm not sure. ISTM that if there is a KMS to manage keys, it could be its 
responsability to actually create a key, however the client (pg) would 
have to request it, basically say "given me a new key for this id".


This could even work with a "get" command only, if the KMS is expected to 
create a new key when asked for a key which does not exists yet. ISTM that 
the client could (should?) only have to create identifiers for its keys.



* If we need only GET protocol, the current approach (i.g.
cluser_passphase_command) would be more simple. I imagine the
interface between postgres and the extension is C function.


Yes. ISTM that can be pretty simple, something like:

A guc to define the process to start the interface (having a process means 
that its uid can be changed), which would communicate on its stdin/stdout.


A guc to define how to interact with the interface (eg whether DEK are 
retrieved, or whether the interface is to ask for encryption/decryption, 
or possibly some other modes).


A few function:

 - set_key(, );
   # may retrieve the DEK, or only note that local id of some key.

 - encode(, ) -> 
   # may fail if no key is associated to local-id
   # or if the service is down somehow

 - decode(, ) -> 
   # could also fail if there is some integrity check associated


This approach is more extensible


Yep.

but it also means extensions need to support multiple protocols, leading 
to increase complexity and development cost.


I do not understand what you mean by "multiple protocols". For me there is 
one protocol, possibly a few commands in this protocol between client 
(postgres) and DMS. Anyway, sending "GET " to retreive a DEK, for 
instance, does not sound "complex". Here is some pseudo code:


For get_key:

  if (mode of operation is to have DEKS locally)
try
  send to KMS "get "
  keys[local-id] = answer
catch & rethrow possible errors
  elif (mode is to keep DEKs remote)
key_id[local-id] = key-id;
  else ...

For encode, the code is basically:

  if (has_key(local-id))
if (mode of operation is to have DEKs locally)
   return some_encode(key[local-id], data);
elif (mode is to keep DEKs remote)
   send to KMS "encode key_id[local-id] data"
   return answer; # or error
else ...
  else
 throw error local-id has no associated key;

decode is more or less the same as encode.

There is another thing to consider is how the client "proves" its identity 
to the KMS interface, which might suggest some provisions when starting a 
process, but you already have things in your patch to deal with the KEK, 
which could be turned into some generic auth.



* This approach necessarily doesn’t eliminate the data leakage threat
completely caused by process compromisation.


Sure, if the process as decrypted data or DEK or whatever, then the 
process compromission leaks these data. My point is to try to limit the 
leakage potential of a process compromission.



Since DEK is placed in postgres process memory,


May be placed, depending on the mode of operation.

it’s still possible that if a postgres process is compromised the 
attacker can steal database data.


Obviously. This cannot be helped if pg is to hold unencrypted data.

The benefit of lowering the possibility of KEK leakage is to deal with 
the threat that the attacker sees database data encrypted by past or 
future DEK protected by the stolen KEK.


Yes.


* An open question is, as you previously mentioned, how to verify the
key obtained from the external place is the

Re: what can go in root.crt ?

2020-06-12 Thread Chapman Flack

On 06/12/20 15:13, Bruce Momjian wrote:
> On Wed, Jun  3, 2020 at 07:57:16PM -0400, Chapman Flack wrote:
>> here is a root.crt file for libpq containing only this exact cert, I
>> want libpq to connect only ever to this server with this cert and nothing
>> else. It's a pain because I have to roll out new root.crt files to everybody
>> whenever the cert changes, but it would be hard to call it unsafe.
> 
> I think you have hit on the reason CAs are used.  By putting a valid
> root certificate on the client, the server certificate can be changed
> without modifying the certificate on the client.
> 
> Without that ability, every client would need be changed as soon as the
> server certificate was changed.  Allowing intermediate certificates to
> function as root certificates would fix that problem.  When the
> non-trusted CA changes your certificate, you are going to have the same
> problem updating everything at once.

There seems to be a use of language here that works to make the picture
muddier rather than clearer.

I mean the use of "trusted"/"non-trusted" as if they somehow mapped onto
"self-signed"/"not self-signed" (unless you had some other mapping in mind
there).

That's downright ironic, as a certificate that is self-signed is one that
carries with it the absolute minimum grounds for trusting it: precisely
zero. There can't be any certificate you have less reason to trust than
a self-signed one.

(Ok, I take it back: a certificate you find on a revocation list /might/
be one you have less reason to trust.)

If a certificate, signed only by itself, ends up being relied on by
a TLS validator, that can only be because it is trusted for some other
reason. Typically that reason is that it has been placed in a file that
can only be edited by the admin who decides what certs to trust. By
editing it into that file, that responsible person has vouched for it,
and /that/ is why the TLS client should trust it. The fact that it is
self-signed, meaning only that nobody else ever vouched for it anywhere,
has nothing to do with why the TLS client should trust it.

Now, suppose that same responsible person edits that same file, but this
time places in it a cert that has been signed by some other authority.
That is a cert that has been vouched for in two ways: by the admin
placing it in this file, and by some other PKI authority.

As far as the TLS client is concerned, the endorsement that counts is
still the local one, that it has been placed in the local file by the
admin responsible for deciding what this client should trust. The fact
that somebody else vouched for it too is no reason for this client
to trust it, but is also no reason for this client not to trust it.
It is certainly in no way less to be trusted than a cert signed only
by itself.

The key point is that as soon as you find the cert you are looking at
in the local file curated by your admin, you know you've been cleared
to trust what you're looking at.

If the cert you're looking at is not in that file, and it has no signer
but itself, you must at that point fail. Dead end. There can be no reason
to trust it.

On the other hand, if you are looking a cert that has a signer, you have
not hit a dead end yet; you can climb that link and hope to find the signer
in your curated file, and so on.

You need to climb until you find something that's in that curated file.
Every step that you climbed needs to have had a valid signature made
while the signer was valid and not revoked, the signer needed to be allowed
to sign certs, and to sign certs for the subject's domain. Those things
needed to be checked at every step.

But once you have followed those steps and arrived at a cert that
was placed in your trust store by the admin, it's unnecessary and
limiting to insist arbitrarily on other properties of the cert you
found there.

> This is why a root certificate, which never changes, is helpful.

But who says it never changes?

As I mentioned earlier, my org has not always had its current procedures
on the issuance of certs, and not so many years ago did have its own
in-house CA. I ran across a copy of that CA cert recently. It was generated
in July of 2010 and is still good through next month. (I have not checked
to see whether the former in-house CA made a revocation entry somewhere
for it before turning the lights out.)

If we were still using that CA cert, I would still have to roll out
new root.crt files next month. I'm sure at the time it was generated,
ten years seemed like 'almost never', and like a reasonable time in which
to hope no adversary would crack a 2048 bit RSA key.

One certainly wouldn't plan on giving an important cert a lifetime
much longer than that.

So the benefit of putting a signing cert in root.crt is not so much
that it will never expire and need updating, but that you can keep
using it to sign other certs for new services you stand up or update,
and so you don't have to distribute new root.crt files every time you
do those

Re: Infinities in type numeric

2020-06-12 Thread Tom Lane

Robert Haas  writes:
> On Fri, Jun 12, 2020 at 2:14 PM Tom Lane  wrote:
>> Robert Haas  writes:
>>> BTW, has there been any thought to supporting a negative scale for the
>>> numeric data type? If you can cut off digits after the decimal, why
>>> not before?

>> Hm, would there be any real use-case?

> Compatibility... apparently people do use it.

Uh, compatibility with what?  Not the SQL spec, for sure.

>> An implementation issue is that even in the "long" numeric format,
>> we cram dscale into a 14-bit unsigned field.

> That doesn't sound too appealing I guess, but couldn't you enforce it
> as a typemod without changing the on-disk representation of the
> values?

On second thought, I'm confusing two distinct though related concepts.
dscale is *display* scale, and it's fine that it's unsigned, because
there is no reason to suppress printing digits to the left of the decimal
point.  ("Whaddya mean, 10 is really 100?")  We could allow the "scale"
part of typmod to be negative and thereby cause an input of, say,
123.45 to be rounded to say 100 --- but it should display as 100 not 1,
so its display scale is still 0.

Hence, there's no pg_upgrade issue.  You'd still need to rethink how
precision and scale get packed into an int32 typmod, but those only
exist in catalog data, so pg_upgrade's schema dump/restore would be
enough to update them.

Having said that, we've long resisted redefining the encoding of
typmod for other data types (despite the clear insanity of some
of the encodings), for fear that client code might be looking at
those catalog columns.  I'm not sure how big a deal that really is.

regards, tom lane

Re: Definitional issue: stddev_pop (and related) for 1 input

2020-06-12 Thread Tom Lane

I wrote:
> Before v12, stddev_pop() had the following behavior with just a
> single input value:
> ...
> As of v12, though, all three cases produce 0.  I am not sure what
> to think about that with respect to an infinity input, but I'm
> quite sure I don't like it for NaN input.

While I'm still not sure whether there's an academic argument that
zero is a reasonable stddev value for a single input that is Inf,
it seems to me that backwards compatibility is a sufficient reason
for going back to producing NaN for that.

Hence, attached are some proposed patches.  0001 just adds test
cases demonstrating the current behavior; then 0002 makes the
proposed code change.  It's easy to check that the test case results
after 0002 match what v11 produces.

0003 deals with a different problem which I noted in [1]: the numeric
variants of var_samp and stddev_samp also do the wrong thing for a
single special input.  Their disease is that they produce NaN for a
single NaN input, where it seems more sensible to produce NULL.
At least, NULL is what we get for the same case with the float
aggregates, so we have to change one or the other set of functions
if we want consistency.

I propose back-patching 0001/0002 as far as v12, since the failure
to match the old outputs seems like a pretty clear bug/regression.
However, I'd be content to apply 0003 only to HEAD.  That misbehavior
is very ancient, and the lack of complaints suggests that few people
really care about this fine point.

regards, tom lane

[1] https://www.postgresql.org/message-id/606717.1591924582%40sss.pgh.pa.us

diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d659013e41..0a6884d382 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -127,7 +127,79 @@ SELECT var_samp(b::numeric) FROM aggtest;
 
 -- population variance is defined for a single tuple, sample variance
 -- is not
-SELECT var_pop(1.0), var_samp(2.0);
+SELECT var_pop(1.0::float8), var_samp(2.0::float8);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop('inf'::float8), var_samp('inf'::float8);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop('nan'::float8), var_samp('nan'::float8);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop(1.0::float4), var_samp(2.0::float4);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop('inf'::float4), var_samp('inf'::float4);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop('nan'::float4), var_samp('nan'::float4);
+ var_pop | var_samp 
+-+--
+   0 | 
+(1 row)
+
+SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4);
+ stddev_pop | stddev_samp 
++-
+  0 |
+(1 row)
+
+SELECT var_pop(1.0::numeric), var_samp(2.0::numeric);
  var_pop | var_samp 
 -+--
0 | 
@@ -139,6 +211,18 @@ SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric);
   0 |
 (1 row)
 
+SELECT var_pop('nan'::numeric), var_samp('nan'::numeric);
+ var_pop | var_samp 
+-+--
+ NaN |  NaN
+(1 row)
+
+SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric);
+ stddev_pop | stddev_samp 
++-
+NaN | NaN
+(1 row)
+
 -- verify correct results for null and NaN inputs
 select sum(null::int4) from generate_series(1,3);
  sum 
@@ -299,6 +383,25 @@ SELECT corr(b, a) FROM aggtest;
  0.139634516517873
 (1 row)
 
+-- check single-tuple behavior
+SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
+ covar_pop | covar_samp 
+---+
+ 0 |   
+(1 row)
+
+SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
+ covar_pop | covar_samp 
+---+
+ 0 |   
+(1 row)
+
+SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+ covar_pop | covar_samp 
+---+
+ 0 |   
+(1 row)

Re: Infinities in type numeric

2020-06-12 Thread Robert Haas

On Fri, Jun 12, 2020 at 2:14 PM Tom Lane  wrote:
> Robert Haas  writes:
> > BTW, has there been any thought to supporting a negative scale for the
> > numeric data type? If you can cut off digits after the decimal, why
> > not before?
>
> Hm, would there be any real use-case?

Compatibility... apparently people do use it.

> An implementation issue is that even in the "long" numeric format,
> we cram dscale into a 14-bit unsigned field.  You could redefine
> the field as signed and pray that nobody has dscales above 8K
> stored on disk, but I'm dubious that there's a good argument for
> taking that risk.

That doesn't sound too appealing I guess, but couldn't you enforce it
as a typemod without changing the on-disk representation of the
values?


-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: what can go in root.crt ?

2020-06-12 Thread Bruce Momjian

On Wed, Jun  3, 2020 at 07:57:16PM -0400, Chapman Flack wrote:
> For example, we might agree that it is safe to trust nothing but the
> end-entity cert of my server itself. I made a server, here is its cert,
> here is a root.crt file for libpq containing only this exact cert, I
> want libpq to connect only ever to this server with this cert and nothing
> else. It's a pain because I have to roll out new root.crt files to everybody
> whenever the cert changes, but it would be hard to call it unsafe.

I think you have hit on the reason CAs are used.  By putting a valid
root certificate on the client, the server certificate can be changed
without modifying the certificate on the client.

Without that ability, every client would need be changed as soon as the
server certificate was changed.  Allowing intermediate certificates to
function as root certificates would fix that problem.  When the
non-trusted CA changes your certificate, you are going to have the same
problem updating everything at once.  This is why a root certificate,
which never changes, is helpful.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: doc examples for pghandler

2020-06-12 Thread Tom Lane

Mark Wong  writes:
> Would some additional procedure language handler code examples in the
> documentation be good to add?  I've put some together in the attached
> patch, and can log it to a future commitfest if people think it would
> be a good addition.

Hmm.  The existing doc examples are really pretty laughable, because
there's such a large gap between the offered skeleton and a workable
handler.  So I agree it'd be nice to do better, but I'm suspicious of
having large chunks of sample code in the docs --- that's a maintenance
problem, if only because we likely won't notice when we break it.
Also, if somebody is hoping to copy-and-paste such code, it isn't
that nice to work from if it's embedded in SGML.

I wonder if it'd be possible to adapt what you have here into some
tiny contrib module that doesn't do very much useful, but can at
least be tested to see that it compiles; moreover it could be
copied verbatim to serve as a starting point for a new PL.

regards, tom lane

Re: Building PostgreSQL extensions on Windows

2020-06-12 Thread Dmitry Igrishin

Hi David,

No pain with CMake. It's pretty easy to use it in Windows for PostgreSQL
extensions. Example, https://github.com/dmitigr/pgnso


On Fri, 12 Jun 2020, 01:43 David Rowley,  wrote:

> Hi,
>
> I've heard from a few people that building PostgreSQL extensions on
> Windows is a bit of a pain. I've heard from these people that their
> solution was to temporarily add their extension as a contrib module
> and have the extension building code take care of creating and
> building the Visual Studio project file.
>
> I also have to say, I do often use Visual Studio myself for PostgreSQL
> development, but when it comes to testing something with an extension,
> I've always avoided the problem and moved over to Linux.
>
> I thought about how we might improve this situation.  The easiest way
> I could think to do this was to just reuse the code that builds the
> Visual Studio project files for contrib modules and write a Perl
> script which calls those functions. Now, these functions, for those
> who have never looked there before, they do use the PGXS compatible
> Makefile as a source of truth and build the VS project file from that.
> I've attached a very rough PoC patch which attempts to do this.
>
> The script just takes the directory of the Makefile as the first
> argument, and optionally the path to pg_config.exe as the 2nd
> argument.  If that happens to be in the PATH environment variable then
> that can be left out.
>
> You end up with:
>
> X:\pg_src\src\tools\msvc>perl extbuild.pl
> X:\pg_src\contrib\auto_explain X:\pg\bin
> Makefile dir = X:\pg_src\contrib\auto_explain
> Postgres include dir = X:\pg\include
> Building = Release
> Detected hardware platform: x64
>
> ...
>
> Build succeeded.
> 0 Warning(s)
> 0 Error(s)
>
> Time Elapsed 00:00:01.13
>
> For now, I've only tested this on a few contrib modules. It does need
> more work to properly build ones with a list of "MODULES" in the
> Makefile. It seems to work ok on the MODULE_big ones that I tested. It
> needs a bit more work to get the resource file paths working properly
> for PROGRAM.
>
> Before I go and invest more time in this, I'd like to get community
> feedback about the idea. Is this something that we'd want? Does it
> seem maintainable enough to have in core?  Is there a better way to do
> it?
>
> David
>
>

Re: Serializable wrong?

2020-06-12 Thread Pantelis Theodosiou

On Fri, Jun 12, 2020 at 6:58 PM Joshua Drake  wrote:

> -Hackers,
>
> I came across this today [1], "
> 3 Results
>
> In most respects, PostgreSQL behaved as expected: both read uncommitted
> and read committed prevent write skew and aborted reads. We observed no
> internal consistency violations. However, we have two surprising results to
> report. The first is that PostgreSQL’s “repeatable read” is weaker than
> repeatable read, at least as defined by Berenson, Adya, Bailis, et al. This
> is not necessarily wrong: the ANSI SQL standard is ambiguous. The second
> result, which is definitely wrong, is that PostgreSQL’s “serializable”
> isolation level isn’t serializable: it allows G2-item during normal
> operation. "
>
> Thanks!
>
> JD
>
> 1. https://jepsen.io/analyses/postgresql-12.3
>

Yes, this has been reported and is under discussion in pgsql-bugs list:

https://www.postgresql.org/message-id/db7b729d-0226-d162-a126-8a8ab2dc4443%40jepsen.io

Postgresql13_beta1 (could not rename temporary statistics file) Windows 64bits

2020-06-12 Thread Ranier Vilela

Posgres13_beta1, is consistently writing to the logs, "could not rename
temporary statistics file".
When analyzing the source that writes the log, I simplified the part that
writes the logs a little.

1. I changed from if else if to if
2. For the user, better to have more errors recorded, which can help in
discovering the problem
3. Errors are independent of each other
4. If I can't release tmpfile, there's no way to delete it (unlink)
5. If I can rename, there is no need to delete it (unlink) tmpfile.

Attached is the patch that proposes these changes.

Now, the problem has not been solved.
1. statfile, is it really closed or does it not exist in the directory?
 There is no way to rename a file, which is open and in use.
2. Why delete (pgstat_stat_filename), if permanent is true:
const char * statfile = permanent? PGSTAT_STAT_PERMANENT_FILENAME:
pgstat_stat_filename;
statfile is PGSTAT_STAT_PERMANENT_FILENAME and not pgstat_stat_filename

regards,
Ranier Vilela


001_simplifies_write_statsfiles.patch
Description: Binary data

Re: how to create index concurrently on partitioned table

2020-06-12 Thread Justin Pryzby

On Fri, Jun 12, 2020 at 04:17:34PM +0800, 李杰(慎追) wrote:
> As we all know, CIC has three transactions. If we recursively in n 
> partitioned tables, 
> it will become 3N transactions. If an error occurs in these transactions, we 
> have too many things to deal...
> 
> If an error occurs when an index is created in one of the partitions, 
> what should we do with our new index?

My (tentative) understanding is that these types of things should use a
"subtransaction" internally..  So if the toplevel transaction rolls back, its
changes are lost.  In some cases, it might be desirable to not roll back, in
which case the user(client) should first create indexes (concurrently if
needed) on every child, and then later create index on parent (that has the
advtantage of working on older servers, too).

postgres=# SET client_min_messages=debug;
postgres=# CREATE INDEX ON t(i);
DEBUG:  building index "t1_i_idx" on table "t1" with request for 1 parallel 
worker
DEBUG:  index "t1_i_idx" can safely use deduplication
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  creating and filling new WAL file
DEBUG:  done creating and filling new WAL file
DEBUG:  building index "t2_i_idx" on table "t2" with request for 1 parallel 
worker
^C2020-06-12 13:08:17.001 CDT [19291] ERROR:  canceling statement due to user 
request
2020-06-12 13:08:17.001 CDT [19291] STATEMENT:  CREATE INDEX ON t(i);
2020-06-12 13:08:17.001 CDT [27410] FATAL:  terminating connection due to 
administrator command
2020-06-12 13:08:17.001 CDT [27410] STATEMENT:  CREATE INDEX ON t(i);
Cancel request sent

If the index creation is interrupted at this point, no indexes will exist.

On Fri, Jun 12, 2020 at 04:06:28PM +0800, 李杰(慎追) wrote:
> >On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > I looked at CIC now and came up with the attached.  All that's needed to 
> > allow
> > this case is to close the relation before recursing to partitions - it 
> > needs to
> > be closed before calling CommitTransactionCommand().  There's probably a 
> > better
> > way to write this, but I can't see that there's anything complicated about
> > handling partitioned tables.
> 
> I'm so sorry about getting back late.
> Thank you very much for helping me consider this issue.
> I compiled the patch v1 you provided. And I patch v2-001 again to enter 
> postgresql.
> I got a coredump that was easy to reproduce. As follows:

> I have been trying to get familiar with the source code of create index.
> Can you solve this bug first? I will try my best to implement CIC with you.
> Next, I will read your patchs v2-002 and v2-003.

Thanks, fixed.

On Fri, Jun 12, 2020 at 04:20:17PM +0900, Michael Paquier wrote:
> When it comes to test behaviors specific to partitioning, there are in
> my experience three things to be careful about and stress in the tests:
> - Use at least two layers of partitioning.
> - Include into the partition tree a partition that has no leaf
> partitions.
> - Test the commands on the top-most parent, a member in the middle of
> the partition tree, the partition with no leaves, and one leaf, making
> sure that relfilenode changes where it should and that partition trees
> remain intact (you can use pg_partition_tree() for that.)

Added, thanks for looking.

-- 
Justin
>From 4d85c34d9d8488fa47371608a14c6f3dcea3f075 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 6 Jun 2020 17:42:23 -0500
Subject: [PATCH v3 1/3] Allow CREATE INDEX CONCURRENTLY on partitioned table

Note, this effectively reverts 050098b14, so take care to not reintroduce the
bug it fixed.
---
 doc/src/sgml/ref/create_index.sgml |  9 
 src/backend/commands/indexcmds.c   | 63 --
 src/test/regress/expected/indexing.out | 26 +--
 src/test/regress/sql/indexing.sql  | 12 +++--
 4 files changed, 71 insertions(+), 39 deletions(-)

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ff87b2d28f..e1298b8523 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -657,15 +657,6 @@ Indexes:
 cannot.

 
-   
-Concurrent builds for indexes on partitioned tables are currently not
-supported.  However, you may concurrently build the index on each
-partition individually and then finally create the partitioned index
-non-concurrently in order to reduce the time where writes to the
-partitioned table will be locked out.  In this case, building the
-partitioned index is a metadata only operation.
-   
-
   
  
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 2baca12c5f..42c905867c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -652,17 +652,6 @@ DefineIndex(Oid relationId,
 	partitioned = rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE;
 	if (partitioned)
 	{
-		/*
-		 * Note: we check 'stmt->concurrent' rather than

Re: Infinities in type numeric

2020-06-12 Thread Tom Lane

Robert Haas  writes:
> BTW, has there been any thought to supporting a negative scale for the
> numeric data type? If you can cut off digits after the decimal, why
> not before?

Hm, would there be any real use-case?

An implementation issue is that even in the "long" numeric format,
we cram dscale into a 14-bit unsigned field.  You could redefine
the field as signed and pray that nobody has dscales above 8K
stored on disk, but I'm dubious that there's a good argument for
taking that risk.

There might be algorithmic issues as well, haven't really looked.
Any such problems would probably be soluble, if need be by forcing
the scale to be at least 0 for calculation and then rounding
afterwards.

regards, tom lane

Re: Infinities in type numeric

2020-06-12 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 [...]
 Tom> so that a finite value should never map to INT[64]_MIN, making it
 Tom> safe to do as you suggest. I agree that distinguishing +Inf from
 Tom> NaN is probably more useful than distinguishing it from the very
 Tom> largest class of finite values, so will do it as you suggest.
 Tom> Thanks!

It would make sense to make sure there's a test case in which at least
one value of all three of: a finite value much greater than 10^332, a
+Inf, and a NaN were all present in the same sort, if there isn't one
already.

-- 
Andrew (irc:RhodiumToad)

Serializable wrong?

2020-06-12 Thread Joshua Drake

-Hackers,

I came across this today [1], "
3 Results

In most respects, PostgreSQL behaved as expected: both read uncommitted and
read committed prevent write skew and aborted reads. We observed no
internal consistency violations. However, we have two surprising results to
report. The first is that PostgreSQL’s “repeatable read” is weaker than
repeatable read, at least as defined by Berenson, Adya, Bailis, et al. This
is not necessarily wrong: the ANSI SQL standard is ambiguous. The second
result, which is definitely wrong, is that PostgreSQL’s “serializable”
isolation level isn’t serializable: it allows G2-item during normal
operation. "

Thanks!

JD

1. https://jepsen.io/analyses/postgresql-12.3

Re: Parallel Seq Scan vs kernel read ahead

2020-06-12 Thread Robert Haas

On Thu, Jun 11, 2020 at 4:54 PM David Rowley  wrote:
> I think someone at some point is not going to like the automatic
> choice. So perhaps a reloption to allow users to overwrite it is a
> good idea. -1 should most likely mean use the automatic choice based
> on relation size.  I think for parallel seq scans that filter a large
> portion of the records most likely need some sort of index, but there
> are perhaps some genuine cases for not having one. e.g perhaps the
> query is just not run often enough for an index to be worthwhile. In
> that case, the performance is likely less critical, but at least the
> reloption would allow users to get the old behaviour.

Let me play the devil's advocate here. I feel like if the step size is
limited by the relation size and there is ramp-up and ramp-down, or
maybe even if you don't have all 3 of those but perhaps say 2 of them,
the chances of there being a significant downside from using this seem
quite small. At that point I wonder whether you really need an option.
It's true that someone might not like it, but there are all sorts of
things that at least one person doesn't like and one can't cater to
all of them.

To put that another way, in what scenario do we suppose that a
reasonable person would wish to use this reloption? If there's none,
we don't need it. If there is one, can we develop a mitigation that
solves their problem automatically instead?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Infinities in type numeric

2020-06-12 Thread Robert Haas

On Fri, Jun 12, 2020 at 1:01 PM Tom Lane  wrote:
> This does tie into something I have a question about in the patch's
> comments though.  As the patch stands, numeric(numeric, integer)
> (that is, the typmod-enforcement function) just lets infinities
> through regardless of the typmod, on the grounds that it is/was also
> letting NaNs through regardless of typmod.  But you could certainly
> make the argument that Inf should only be allowed in an unconstrained
> numeric column, because by definition it overflows any finite precision
> restriction.  If we did that, you'd never see Inf in a
> standard-conforming column, since SQL doesn't allow unconstrained
> numeric columns IIRC.  That'd at least ameliorate your concern.

Yes, I agree. It also seems like a more principled choice - I am not
sure why if I ask for a number no larger than 10^3 we ought to permit
infinity.

BTW, has there been any thought to supporting a negative scale for the
numeric data type? If you can cut off digits after the decimal, why
not before?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Internal key management system

2020-06-12 Thread Bruce Momjian

On Fri, Jun 12, 2020 at 09:17:37AM +0200, Fabien COELHO wrote:
> The question is what should be put in the protocol, and I would tend to
> think that some careful design time should be put in it.

I still don't see the value of this vs. its complexity.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee

Re: WIP/PoC for parallel backup

2020-06-12 Thread Robert Haas

On Thu, Jun 11, 2020 at 1:41 PM Hamid Akhtar  wrote:
> As far I understand, parallel backup is not a mandatory performance feature, 
> rather, one at user's discretion. This IMHO indicates that it will benefit 
> some users and it may not others.
>
> IMHO, parallel backup has obvious performance benefits, but we need to ensure 
> that users understand that there is potential for slower backup if there is 
> no competition for resources.

I am sure that nobody is arguing that the patch has to be beneficial
in all cases in order to justify applying it. However, there are
several good arguments against proceding with this patch:

* Every version of the patch that has been reviewed by anybody has
been riddled with errors. Over and over again, testers have found
serious bugs, and code reviewers have noticed lots of problems, too.

* This approach requires rewriting a lot of current functionality,
either by moving it to the client side or by restructuring it to work
with parallelism. That's a lot of work, and it seems likely to
generate more work in the future as people continue to add features.
It's one thing to add a feature that doesn't benefit everybody; it's
another thing to add a feature that doesn't benefit everybody and also
hinders future development. See
http://postgr.es/m/ca+tgmozublxyr+pd_gi3mvgyv5hqdlm-gbrvxkun-lewaw1...@mail.gmail.com
for more discussion of these issues.

* The scenarios in which the patch delivers a performance benefit are
narrow and somewhat contrived. In remote backup scenarios, AIUI, the
patch hasn't been shown to help. In local backups, it does, but how
likely is it that you are going to do your local backups over the wire
protocol instead of by direct file copy, which is probably much
faster? I agree that if your server is overloaded, having multiple
processes competing for the server resources will allow backup to get
a larger slice relative to other things, but that seems like a pretty
hackish and inefficient solution to that problem. You could also argue
that we could provide a feature to prioritize some queries over other
queries by running them with tons of parallel workers just to convince
the OS to give them more resources, and I guess that would work, but
it would also waste tons of resources and possibly cripple or even
crash your system if you used it enough. The same argument applies
here.

* Even when the patch does provide a benefit, it seems to max out at
about 2.5X. Clearly it's nice to have something go 2.5X faster, but
the point is that it doesn't scale beyond that no matter how many
workers you add. That doesn't automatically mean that something is a
bad idea, but it is a concern. At the very least, we should be able to
say why it doesn't scale any better than that.

* Actually, we have some hints about that. Over at
http://postgr.es/m/20200503174922.mfzzdafa5g4rl...@alap3.anarazel.de
Andres has shown that too much concurrency when copying files results
in a dramatic performance reduction, and that a lot of the reason why
concurrency helps in the first place has to do with the fact that
pg_basebackup does not have any cache control (no fallocate,
sync_file_range(WRITE), posix_fadvise(DONTNEED)). When those things
are added the performance gets better and the benefits of concurrency
are reduced. I suspect that would also be true for this patch. It
would be unreasonable to commit a large patch, especially one that
would hinder future development, if we could get the same benefits
from a small patch that would not do so.

I am not in a position to tell you how to spend your time, so you can
certainly pursue this patch if you wish. However, I think it's
probably not the best use of time. Even if you fixed all the bugs and
reimplemented all of the functionality that needs reimplementing in
order to make this approach work, it still doesn't make sense to
commit the patch if either (a) we can obtain the same benefit, or most
of it, from a much simpler patch or (b) the patch is going to make it
significantly harder to develop other features that we want to have,
especially if those features seem likely to be more beneficial than
what this patch offers. I think both of those are likely true here.

For an example of (b), consider compression of tar files on the server
side before transmission to the client. If you take the approach this
patch does and move tarfile construction to the client, that is
impossible. Now you can argue (and perhaps you will) that this would
just mean someone has to choose between using this feature and using
that feature, and why should users not have such a choice? That is a
fair argument, but my counter-argument is that users shouldn't be
forced into making that choice. If the parallel feature is beneficial
enough to justify having it, then it ought to be designed in such a
way that it works with the other features we also want to have rather
than forcing users to choose between them. Since I have already
proposed (on the other thread

doc examples for pghandler

2020-06-12 Thread Mark Wong

Hi everyone,

Would some additional procedure language handler code examples in the
documentation be good to add?  I've put some together in the attached
patch, and can log it to a future commitfest if people think it would
be a good addition.

Regards,
Mark
-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/
diff --git a/doc/src/sgml/plhandler.sgml b/doc/src/sgml/plhandler.sgml
index e1b0af7a60..0287d424cb 100644
--- a/doc/src/sgml/plhandler.sgml
+++ b/doc/src/sgml/plhandler.sgml
@@ -241,4 +241,560 @@ CREATE LANGUAGE plsample
 reference page also has some useful details.

 
+   
+The following subsections contain additional examples to help build a
+complete procedural language handler.
+   
+
+   
+Minimal Example
+
+
+ Here is a complete minimal example that builds a procedural language
+ handler PL/Sample as an extension.  Functions
+ can be created and called for PL/Sample but
+ they will not be able to perform any usefule actions.
+
+
+
+ The plsample--0.1.sql file:
+
+CREATE FUNCTION plsample_call_handler()
+RETURNS language_handler
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE LANGUAGE plsample
+HANDLER plsample_call_handler;
+
+COMMENT ON LANGUAGE plsample IS 'PL/Sample procedural language';
+
+
+
+
+ The control file plsample.control looks like this:
+
+comment = 'PL/Sample'
+default_version = '0.1'
+module_pathname = '$libdir/plsample'
+relocatable = false
+schema = pg_catalog
+superuser = false
+
+ See  for more information about writing
+ control files.
+
+
+
+ The following Makefile relies on
+ PGXS.
+
+PGFILEDESC = "PL/Sample - procedural language"
+
+EXTENSION = plsample
+EXTVERSION = 0.1
+
+MODULE_big = plsample
+
+OBJS = plsample.o
+
+DATA = plsample.control plsample--0.1.sql
+
+plsample.o: plsample.c
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+
+ See  for more information on makefiles with
+ PGXS.
+
+
+
+ Here is the minimal C code in plsample.c that will
+ handle calls to this sample procedural language:
+
+#include postgres.h
+#include fmgr.h
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(plsample_call_handler);
+
+/*
+ * Handle function, procedure, and trigger calls.
+ */
+Datum
+plsample_call_handler(PG_FUNCTION_ARGS)
+{
+return 0;
+}
+
+
+
+
+ The following sections will continue building upon this example to
+ describe how to add additional functionality to a call handler for a
+ procedural language.
+
+   
+
+   
+Fetching the source of a function
+
+
+ Additional code is added to plsample.c from  to include additional headers for the
+ additional code that fetches the source text of the function.  The
+ resulting file now looks like:
+
+#include postgres.h
+#include fmgr.h
+#include funcapi.h
+#include access/htup_details.h
+#include catalog/pg_proc.h
+#include catalog/pg_type.h
+#include utils/memutils.h
+#include utils/builtins.h
+#include utils/syscache.h
+
+MemoryContext TopMemoryContext = NULL;
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(plsample_call_handler);
+
+/*
+ * Handle function, procedure, and trigger calls.
+ */
+
+Datum
+plsample_call_handler(PG_FUNCTION_ARGS)
+{
+HeapTuple pl_tuple;
+Datum pl_datum;
+const char *source;
+bool isnull;
+
+/* Fetch the source of the function. */
+
+pl_tuple = SearchSysCache(PROCOID,
+ObjectIdGetDatum(fcinfo->flinfo->fn_oid), 0, 0, 0);
+if (!HeapTupleIsValid(pl_tuple))
+elog(ERROR, "cache lookup failed for function %u",
+fcinfo->flinfo->fn_oid);
+
+pl_datum = SysCacheGetAttr(PROCOID, pl_tuple, Anum_pg_proc_prosrc,
+isnull);
+if (isnull)
+elog(ERROR, "null prosrc");
+ReleaseSysCache(pl_tuple);
+
+source = DatumGetCString(DirectFunctionCall1(textout, pl_datum));
+elog(LOG, "source text:\n%s", source);
+
+return 0;
+}
+
+ The variable source containes the source that
+ needs to be interpreted and executed by the procedurual language handler
+ itself, or by an existing inplementation of the programming language that
+ the source text is written with.
+
+
+
+ The following CREATE FUNCTION will set the source text
+ of the function to This is function's source text.:
+
+CREATE FUNCTION plsample_func()
+RETURNS VOID
+AS $$
+  This is function's source text.
+$$ LANGUAGE plsample;
+
+
+
+
+ Calling the function with the following command will log the function's
+ source text because of the elog() statement towards
+ the end of plsample_call_handler() function:
+
+SELECT plsample_func();
+
+ and emits the following to the log file:
+
+LOG:  source text:
+
+  This is function's source text.
+
+
+
+   
+
+   
+Analyzing function

Re: Infinities in type numeric

2020-06-12 Thread Tom Lane

Robert Haas  writes:
> On Thu, Jun 11, 2020 at 9:16 PM Tom Lane  wrote:
>> We had a discussion recently about how it'd be a good idea to support
>> infinity values in type numeric [1].

> FWIW, I don't particularly like the idea. Back when I was an
> application developer, I remember having to insert special cases into
> any code that dealt with double precision to deal with +/-Inf and NaN.
> I was happy that I didn't need them for numeric, too. So this change
> would have made me sad.

Well, you're already stuck with special-casing numeric NaN, so I'm
not sure that Inf makes your life noticeably worse on that score.

This does tie into something I have a question about in the patch's
comments though.  As the patch stands, numeric(numeric, integer)
(that is, the typmod-enforcement function) just lets infinities
through regardless of the typmod, on the grounds that it is/was also
letting NaNs through regardless of typmod.  But you could certainly
make the argument that Inf should only be allowed in an unconstrained
numeric column, because by definition it overflows any finite precision
restriction.  If we did that, you'd never see Inf in a
standard-conforming column, since SQL doesn't allow unconstrained
numeric columns IIRC.  That'd at least ameliorate your concern.

If we were designing this today, I think I'd vote to disallow NaN
in a constrained numeric column, too.  But I suppose it's far too
late to change that aspect.

regards, tom lane

Re: Infinities in type numeric

2020-06-12 Thread Robert Haas

On Thu, Jun 11, 2020 at 9:16 PM Tom Lane  wrote:
> We had a discussion recently about how it'd be a good idea to support
> infinity values in type numeric [1].

Only a minority of that discussion was actually on that topic, and I'm
not sure there was a clear consensus in favor of it.

FWIW, I don't particularly like the idea. Back when I was an
application developer, I remember having to insert special cases into
any code that dealt with double precision to deal with +/-Inf and NaN.
I was happy that I didn't need them for numeric, too. So this change
would have made me sad.

It's possible I'm the only one, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Wrong width of UNION statement

2020-06-12 Thread Kenichiro Tanaka

Hello,Ashutosh
Thank you for your response.

>We already have that infrastructure, IIUC through commit
That's good!

>Can we use that to fix this bug?
I'll try it.
But this is my first hack,I think I need time.

Regards
Kenichiro Tanaka

2020年6月8日(月) 22:11 Ashutosh Bapat :
>
> On Mon, Jun 1, 2020 at 8:35 PM Tom Lane  wrote:
> >
> > Kenichiro Tanaka  writes:
> > > I think table column width of  UNION statement should be equal one of 
> > > UNION ALL.
> >
> > I don't buy that argument, because there could be type coercions involved,
> > so that the result width isn't necessarily equal to any one of the inputs.
> >
> > Having said that, the example you give shows that we make use of
> > pg_statistic.stawidth values when estimating the width of immediate
> > relation outputs, but that data isn't available by the time we get to
> > a UNION output.  So we fall back to get_typavgwidth, which in this
> > case is going to produce something involving the typmod times the
> > maximum encoding length.  (I suppose you're using UTF8 encoding...)
> >
> > There's room for improvement there, but this is all bound up in the legacy
> > mess that we have in prepunion.c.  For example, because we don't have
> > RelOptInfo nodes associated with individual set-operation outputs,
>
> We already have that infrastructure, IIUC through commit
>
> commit c596fadbfe20ff50a8e5f4bc4b4ff5b7c302ecc0
> Author: Robert Haas 
> Date:   Mon Mar 19 11:55:38 2018 -0400
>
> Generate a separate upper relation for each stage of setop planning.
>
> Can we use that to fix this bug?
>
> --
> Best Wishes,
> Ashutosh Bapat

Review for GetWALAvailability()

2020-06-12 Thread Fujii Masao


Hi,

The document explains that "lost" value that
pg_replication_slots.wal_status reports means

some WAL files are definitely lost and this slot cannot be used to resume 
replication anymore.

However, I observed "lost" value while inserting lots of records,
but replication could continue normally. So I wonder if
pg_replication_slots.wal_status may have a bug.

wal_status is calculated in GetWALAvailability(), and probably I found
some issues in it.


keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
  wal_segment_size) + 1;

max_wal_size_mb is the number of megabytes. wal_keep_segments is
the number of WAL segment files. So it's strange to calculate max of them.
The above should be the following?

Max(ConvertToXSegs(max_wal_size_mb, wal_segment_size), wal_keep_segments) + 
1



if ((max_slot_wal_keep_size_mb <= 0 ||
 max_slot_wal_keep_size_mb >= max_wal_size_mb) &&
oldestSegMaxWalSize <= targetSeg)
return WALAVAIL_NORMAL;

This code means that wal_status reports "normal" only when
max_slot_wal_keep_size is negative or larger than max_wal_size.
Why is this condition necessary? The document explains "normal
 means that the claimed files are within max_wal_size". So whatever
 max_slot_wal_keep_size value is, IMO that "normal" should be
 reported if the WAL files claimed by the slot are within max_wal_size.
 Thought?

Or, if that condition is really necessary, the document should be
updated so that the note about the condition is added.



If the WAL files claimed by the slot exceeds max_slot_wal_keep_size
but any those WAL files have not been removed yet, wal_status seems
to report "lost". Is this expected behavior? Per the meaning of "lost"
described in the document, "lost" should be reported only when
any claimed files are removed, I think. Thought?

Or this behavior is expected and the document is incorrect?



BTW, if we want to implement GetWALAvailability() as the document
advertises, we can simplify it like the attached POC patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 55cac186dc..0b9cca2173 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9504,62 +9504,29 @@ GetWALAvailability(XLogRecPtr targetLSN)
XLogSegNo   currSeg;/* segid of currpos */
XLogSegNo   targetSeg;  /* segid of targetLSN */
XLogSegNo   oldestSeg;  /* actual oldest segid */
-   XLogSegNo   oldestSegMaxWalSize;/* oldest segid kept by 
max_wal_size */
-   XLogSegNo   oldestSlotSeg = InvalidXLogRecPtr;  /* oldest segid 
kept by
-   
 * slot */
uint64  keepSegs;
 
/* slot does not reserve WAL. Either deactivated, or has never been 
active */
if (XLogRecPtrIsInvalid(targetLSN))
return WALAVAIL_INVALID_LSN;
 
-   currpos = GetXLogWriteRecPtr();
-
/* calculate oldest segment currently needed by slots */
XLByteToSeg(targetLSN, targetSeg, wal_segment_size);
-   KeepLogSeg(currpos, );
 
-   /*
-* Find the oldest extant segment file. We get 1 until checkpoint 
removes
-* the first WAL segment file since startup, which causes the status 
being
-* wrong under certain abnormal conditions but that doesn't actually 
harm.
-*/
-   oldestSeg = XLogGetLastRemovedSegno() + 1;
+   /* Find the oldest extant segment file */
+   oldestSeg = XLogGetLastRemovedSegno();
 
-   /* calculate oldest segment by max_wal_size and wal_keep_segments */
+   if (targetSeg <= oldestSeg)
+   return WALAVAIL_REMOVED;
+
+   currpos = GetXLogWriteRecPtr();
XLByteToSeg(currpos, currSeg, wal_segment_size);
-   keepSegs = ConvertToXSegs(Max(max_wal_size_mb, wal_keep_segments),
- wal_segment_size) + 1;
+   keepSegs = ConvertToXSegs(max_wal_size_mb, wal_segment_size);
 
-   if (currSeg > keepSegs)
-   oldestSegMaxWalSize = currSeg - keepSegs;
-   else
-   oldestSegMaxWalSize = 1;
-
-   /*
-* If max_slot_wal_keep_size has changed after the last call, the 
segment
-* that would been kept by the current setting might have been lost by 
the
-* previous setting. No point in showing normal or keeping status values
-* if the targetSeg is known to be lost.
-*/
-   if (targetSeg >= oldestSeg)
-   {
-   /*
-* show "normal" when targetSeg is within max_wal_size, even if
-*

Re: exp() versus the POSIX standard

2020-06-12 Thread Emre Hasegeli

> No, no kind of GUC switch is needed. Just drop underflow/overflow checks. 
> You'll get 0 or Infinity in expected places, and infinities are okay since 
> 2007.

This is out of scope of this thread.  I am not sure opening it to
discussion on another thread would yield any result.  Experienced
developers like Tom appear to be in agreement of us needing to protect
users from oddities of floating point numbers.  (I am not.)

Re: exp() versus the POSIX standard

2020-06-12 Thread Emre Hasegeli

> Does anyone disagree that that's a bug?  Should we back-patch
> a fix, or just change it in HEAD?  Given the lack of user
> complaints, I lean a bit towards the latter, but am not sure.

The other functions and operators pay attention to not give an error
when the input is Inf or 0.   exp() and power() are at least
inconsistent by doing so.  I don't think this behavior is useful.
Although it'd still be less risky to fix it in HEAD only.

Re: Internal key management system

2020-06-12 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 16:17, Fabien COELHO  wrote:
>
>
> Hello Masahiko-san,
>
> I'm not sure I understood your concern. I try to answer below.
>
> > If I understand your idea correctly we put both DEK and KEK
> > "elsewhere", and a postgres process gets only DEK from it.
>
> Yes, that is one of the option.
>
> > It seems to me this idea assumes that the place storing encryption keys
> > employees 2-tire key hierarchy or similar thing.
>
> ISTM that there is no such assumption. There is the assumption that there
> is an interface to retrieve DEK. What is done being the interface to
> retrieve this DEK should be irrelevant to pg. Having them secure by a
> KEK looks like an reasonable design, though. Maybe keys are actually
> stored. Maybe thay are computed based on something, eg key identifier and
> some secret. Maybe there is indeed a 2-tier something. Maybe whatever.
>
> > What if the user wants to or has to manage a single encryption key?
>
> Then it has one key identifier and it retrieve one key from the DMS.
> Having a "management system" for a singleton looks like overkill though,
> but it should work.
>
> > For example, storing an encryption key for PostgreSQL TDE into a file in
> > a safe server instead of KMS using DEK and KEK because of budgets or
> > requirements whatever.
>
> Good. If you have an interface to retrieve a key, then it can probably
> contact said server to get it when needed?
>
> > In this case, if the user does key rotation, that encryption key would
> > need to be rotated, resulting in the user would need to re-encrypt all
> > database data encrypted with old key.
>
> Sure, by definition actually changing the key requires a
> decryption/encryption cycle on all data.
>
> > It should work but what do you think about how postgres does key
> > rotation and re-encryption?
>
> If pg actually has the DEK, then it means that while the re-encryption is
> performed it has to manage two keys simultenaously, this is a question for
> what is done on pg server with the keys, not really about the DMS ?

Yes. Your explanation made my concern clear. Thanks!

>
> If the "elsewhere" service does the encryption, maybe the protocol could
> include it, eg something like:
>
> REC key1-id key2-id data-encrypted-with-key1
>   -> data-encrypted-with-key2
>
> But it could also achieve the same thing with two commands, eg:
>
> DEC key1-id data-encrypted-with-key1
>   -> clear-text-data
>
> ENC key2-id clear-text-data
>   -> data-encrypted-with-key2
>
> The question is what should be put in the protocol, and I would tend to
> think that some careful design time should be put in it.
>

Summarizing the discussed points so far, I think that the major
advantage points of your idea comparing to the current patch's
architecture are:

* More secure. Because it never loads KEK in postgres processes we can
lower the likelihood of KEK leakage.
* More extensible. We will be able to implement more protocols to
outsource other operations to the external place.

On the other hand, here are some downsides and issues:

* The external place needs to manage more encryption keys than the
current patch does. Some cloud key management services are charged by
the number of active keys and key operations. So the number of keys
postgres requires affects the charges. It'd be worse if we were to
have keys per table.

* If this approach supports only GET protocol, the user needs to
create encryption keys with appropriate ids in advance so that
postgres can get keys by id. If postgres TDE creates keys as needed,
CREATE protocol would also be required.

* If we need only GET protocol, the current approach (i.g.
cluser_passphase_command) would be more simple. I imagine the
interface between postgres and the extension is C function. This
approach is more extensible but it also means extensions need to
support multiple protocols, leading to increase complexity and
development cost.

* This approach necessarily doesn’t eliminate the data leakage threat
completely caused by process compromisation. Since DEK is placed in
postgres process memory, it’s still possible that if a postgres
process is compromised the attacker can steal database data. The
benefit of lowering the possibility of KEK leakage is to deal with the
threat that the attacker sees database data encrypted by past or
future DEK protected by the stolen KEK.

* An open question is, as you previously mentioned, how to verify the
key obtained from the external place is the right key.

Anything else we need to note?

Finally, please understand I’m not controverting your idea but just
trying to understand which approach is better from multiple
perspectives.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Infinities in type numeric

2020-06-12 Thread Tom Lane

Andrew Gierth  writes:
> "Tom" == Tom Lane  writes:
>  Tom> +#define NUMERIC_ABBREV_PINF 
> NumericAbbrevGetDatum(PG_INT64_MIN)
>  Tom> +#define NUMERIC_ABBREV_PINF 
> NumericAbbrevGetDatum(PG_INT32_MIN)

> I'd have been more inclined to go with -PG_INT64_MAX / -PG_INT32_MAX for
> the NUMERIC_ABBREV_PINF value. It seems more likely to be beneficial to
> bucket +Inf and NaN separately (and put +Inf in with the "too large to
> abbreviate" values) than to bucket them together so as to distinguish
> between +Inf and "too large" values. But this is an edge case in any
> event, so it probably wouldn't make a great deal of difference unless
> you're sorting on data with a large proportion of both +Inf and NaN
> values.

I had been worried about things possibly sorting in the wrong order
if I did that.  However, now that I look more closely I see that

 * We convert the absolute value of the numeric to a 31-bit or 63-bit positive
 * value, and then negate it if the original number was positive.

so that a finite value should never map to INT[64]_MIN, making it
safe to do as you suggest.  I agree that distinguishing +Inf from NaN
is probably more useful than distinguishing it from the very largest
class of finite values, so will do it as you suggest.  Thanks!

regards, tom lane

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-12 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 19:24, Amit Kapila  wrote:
>
> On Fri, Jun 12, 2020 at 2:10 PM Masahiko Sawada
>  wrote:
> >
> > On Fri, 12 Jun 2020 at 15:37, Amit Kapila  wrote:
> > >
> > > > >
> > > > > I think this is a corner case and it is better to simplify the state
> > > > > recording of foreign transactions then to save a CLOG lookup.
> > > > >
> > > >
> > > > The main usage of in-doubt flag is to distinguish between in-doubt
> > > > transactions and other transactions that have their waiter (I call
> > > > on-line transactions).
> > > >
> > >
> > > Which are these other online transactions?  I had assumed that foreign
> > > transaction resolver process is to resolve in-doubt transactions but
> > > it seems it is also used for some other purpose which anyway was the
> > > next question I had while reviewing other sections of docs but let's
> > > clarify as it came up now.
> >
> > When a distributed transaction is committed by COMMIT command, the
> > postgres backend process prepare all foreign transaction and commit
> > the local transaction.
> >

Thank you for your review comments! Let me answer your question first.
I'll see the review comments.

>
> Does this mean that we will mark the xid as committed in CLOG of the
> local server?

Well what I meant is that when the client executes COMMIT command, the
backend executes PREPARE TRANSACTION command on all involved foreign
servers and then marks the xid as committed in clog in the local
server.

>   If so, why is this okay till we commit transactions in
> all the foreign servers, what if we fail to commit on one of the
> servers?

Once the local transaction is committed, all involved foreign
transactions never be rolled back. The backend already prepared all
foreign transaction before local commit, committing prepared foreign
transaction basically doesn't fail. But even if it fails for whatever
reason, we never rollback the all prepared foreign transactions. A
resolver tries to commit foreign transactions at certain intervals.
Does it answer your question?

Regard,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Infinities in type numeric

2020-06-12 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 Tom> @@ -359,10 +390,14 @@ typedef struct NumericSumAccum
 Tom>  #define NumericAbbrevGetDatum(X) ((Datum) (X))
 Tom>  #define DatumGetNumericAbbrev(X) ((int64) (X))
 Tom>  #define NUMERIC_ABBREV_NAN
NumericAbbrevGetDatum(PG_INT64_MIN)
 Tom> +#define NUMERIC_ABBREV_PINF   
NumericAbbrevGetDatum(PG_INT64_MIN)
 Tom> +#define NUMERIC_ABBREV_NINF   
NumericAbbrevGetDatum(PG_INT64_MAX)
 Tom>  #else
 Tom>  #define NumericAbbrevGetDatum(X) ((Datum) (X))
 Tom>  #define DatumGetNumericAbbrev(X) ((int32) (X))
 Tom>  #define NUMERIC_ABBREV_NAN
NumericAbbrevGetDatum(PG_INT32_MIN)
 Tom> +#define NUMERIC_ABBREV_PINF   
NumericAbbrevGetDatum(PG_INT32_MIN)
 Tom> +#define NUMERIC_ABBREV_NINF   
NumericAbbrevGetDatum(PG_INT32_MAX)
 Tom>  #endif

I'd have been more inclined to go with -PG_INT64_MAX / -PG_INT32_MAX for
the NUMERIC_ABBREV_PINF value. It seems more likely to be beneficial to
bucket +Inf and NaN separately (and put +Inf in with the "too large to
abbreviate" values) than to bucket them together so as to distinguish
between +Inf and "too large" values. But this is an edge case in any
event, so it probably wouldn't make a great deal of difference unless
you're sorting on data with a large proportion of both +Inf and NaN
values.

(It would be possible to add another bucket so that "too large", +Inf,
and NaN were three separate buckets, but honestly any more complexity
seems not worth it for handling an edge case.)

The actual changes to the abbrev stuff look fine to me.

-- 
Andrew (irc:RhodiumToad)

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-12 Thread Magnus Hagander

On Fri, Jun 12, 2020 at 10:23 AM Amit Kapila 
wrote:

> On Fri, Jun 12, 2020 at 11:20 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, 12 Jun 2020 at 12:21, Amit Kapila 
> wrote:
> > >
> > > > Since the logical decoding intermediate files are written at per
> slots
> > > > directory, I thought that corresponding these statistics to
> > > > replication slots is also understandable for users.
> > > >
> > >
> > > What I wanted to know is how will it help users to tune
> > > logical_decoding_work_mem?  Different backends can process from the
> > > same slot, so it is not clear how user will be able to make any
> > > meaning out of those stats.
> >
> > I thought that the user needs to constantly monitor them during one
> > process is executing logical decoding and to see the increments. I
> > might not fully understand but I guess the same is true for displaying
> > them w.r.t. process. Since a process can do logical decoding several
> > times using the same slot with a different setting, the user will need
> > to monitor them several times.
> >
>
> Yeah, I think we might not be able to get exact measure but if we
> divide total_size spilled by exec_count, we will get some rough idea
> of what should be the logical_decoding_work_mem for that particular
> session.  For ex. consider the logical_decoding_work_mem is 100bytes
> for a particular backend and the size spilled by that backend is 100
> then I think you can roughly keep it to 200bytes if you want to avoid
> spilling.  Similarly one can compute its average value over multiple
> executions.  Does this make sense to you?
>

The thing that becomes really interesting is to analyze this across time.
For example to identify patterns where it always spills at the same time as
certain other things are happening. For that usecase, having a "predictable
persistence" is important. You may not be able to afford setting
logical_decoding_work_mem high enough to cover every possible scenario (if
you did, then we would basically not need the spilling..), so you want to
track down in relation to the rest of your application exactly when and how
this is happening.


>
> > > OTOH, it is easier to see how to make
> > > meaning of these stats if we display them w.r.t process.  Basically,
> > > we have spill_count and spill_size which can be used to tune
> > > logical_decoding_work_mem and also the activity of spilling happens at
> > > process level, so it sounds like one-to-one mapping.
> >
> > Displaying them w.r.t process also seems a good idea but I'm still
> > unclear what to display and how long these values are valid.
> >
>
> I feel till the lifetime of a process if we want to display the values
> at process level but I am open to hear others (including yours) views
> on this.
>

The problem with "lifetime of a process" is that it's not predictable. A
replication process might "bounce" for any reason, and it is normally not a
problem. But if you suddenly lose your stats when you do that, it starts to
matter a lot more. Especially when you don't know if it bounced. (Sure you
can look at the backend_start time, but that adds a whole different sets of
complexitites).


> > The view

> > will have the following columns for example?
> >
> > * pid
> > * slot_name
> > * spill_txns
> > * spill_count
> > * spill_bytes
> > * exec_count
> >
>
> Yeah, these appear to be what I have in mind.  Note that we can have
> multiple entries of the same pid here because of slotname, there is
> some value to display slotname but I am not completely sure if that is
> a good idea but I am fine if you have a reason to include slotname?
>

Well, it's a general view so you can always GROUP BY that away if you want
at reading point?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Getting ERROR with FOR UPDATE/SHARE for partitioned table.

2020-06-12 Thread Ashutosh Bapat

On Wed, Jun 3, 2020 at 12:44 PM Amit Langote  wrote:
>
> On Thu, May 28, 2020 at 11:08 PM Ashutosh Bapat
>  wrote:
> > On Wed, May 27, 2020 at 6:51 PM Amit Langote  
> > wrote:
> > > So in Rajkumar's example, the cursor is declared as:
> > >
> > > CURSOR IS SELECT * FROM tbl WHERE c1< 5 FOR UPDATE;
> > >
> > > and the WHERE CURRENT OF query is this:
> > >
> > >  UPDATE tbl SET c2='aa' WHERE CURRENT OF cur;
> >
> > Thanks for the clarification. So it looks like we expand UPDATE on
> > partitioned table to UPDATE on each partition (inheritance_planner for
> > DML) and then execute each of those. If CURRENT OF were to save the
> > table oid or something we could run the UPDATE only on that partition.
>
> Are you saying that the planner should take into account the state of
> the cursor specified in WHERE CURRENT OF to determine which of the
> tables to scan for the UPDATE?  Note that neither partition pruning
> nor constraint exclusion know that CurrentOfExpr can possibly allow to
> exclude children of the UPDATE target.

Yes. But it may not be possible to know the value of current of at the
time of planning since that need not be a plan time constant. This
pruning has to happen at run time. But as Alvaro has mentioned in his
reply for a user this is a surprising behaviour and should be fixed.

-- 
Best Wishes,
Ashutosh Bapat

Re: doc review for v13

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 09:37:09PM -0500, Justin Pryzby wrote:
> Some new bits,
> And some old ones.

I have merged 0003 and 0004 together and applied them.  0005 seems to
have a separate issue as mentioned upthread, and I have not really
looked at 0001 and 0002.  Thanks.
--
Michael


signature.asc
Description: PGP signature

Re: TAP tests and symlinks on Windows

2020-06-12 Thread Juan José Santamaría Flecha

On Fri, Jun 12, 2020 at 9:00 AM Michael Paquier  wrote:

> On Tue, Jun 09, 2020 at 11:26:19AM +0100, Dagfinn Ilmari Mannsåker wrote:
> > Plus a note in the Win32 docs that Win32::Symlink may be required to run
> > some tests on some Perl/Windows versions..
>
> Planting such a check in individual scripts is not a good idea because
> it would get forgotten.  The best way to handle that is to add a new
> check in the BEGIN block of TestLib.pm.  Note that we already do that
> with createFile, OsFHandleOpen and CloseHandle.  Now the question is:
> do we really want to make this a hard requirement?  I would like to
> answer yes so as we make sure that this gets always tested, and this
> needs proper documentation as you say.  Now it would be also possible
> to check if the API is present in the BEGIN block of TestLib.pm, and
> then use an independent variable similar to what we do with
> $use_unix_sockets to decide if tests should be skipped or not, but you
> cannot know if this gets actually, or ever, tested.


The first thing that comes to mind is adding an option to vcregress to
choose whether symlinks will be tested or skipped, would that be an
acceptable solution?

Regards,

Juan José Santamaría Flecha

>
>

compile cube extension with float4 precision storing

2020-06-12 Thread Siarhei D

Is it possible to make cube extension with float(4bytes) precision instead
of double(8bytes)?

I use cube extension for storing embedding vectors and calculation distance
on them. During comparing vectors, a 4byte float precision is enough.
Storing 8 byte double precision is wasting disk space.

Now to avoid disk wasting I store vectors as real[] array and create cube
objects on the fly. But this solution is wasting cpu time

--

Siarhei Damanau

Re: Parallel copy

2020-06-12 Thread Ashutosh Sharma

Hi All,

I've spent little bit of time going through the project discussion that has
happened in this email thread and to start with I have few questions which
I would like to put here:

Q1) Are we also planning to read the input data in parallel or is it only
about performing the multi-insert operation in parallel? AFAIU, the data
reading part will be done by the leader process alone so no parallelism is
involved there.

Q2) How are we going to deal with the partitioned tables? I mean will there
be some worker process dedicated for each partition or how is it? Further,
the challenge that I see incase of partitioned tables is that we would have
a single input file containing data to be inserted into multiple tables
(aka partitions) unlike the normal case where all the tuples in the input
file would belong to the same table.

Q3) Incase of toast tables, there is a possibility of having a single tuple
in the input file which could be of a very big size (probably in GB)
eventually resulting in a bigger file size. So, in this case, how are we
going to decide the number of worker processes to be launched. I mean,
although the file size is big, but the number of tuples to be processed is
just one or few of them, so, can we decide the number of the worker
processes to be launched based on the file size?

Q4) Who is going to process constraints (preferably the deferred
constraint) that is supposed to be executed at the COMMIT time? I mean is
it the leader process or the worker process or in such cases we won't be
choosing the parallelism at all?

Q5) Do we have any risk of table bloating when the data is loaded in
parallel. I am just asking this because incase of parallelism there would
be multiple processes performing bulk insert into a table. There is a
chance that the table file might get extended even if there is some space
into the file being written into, but that space is locked by some other
worker process and hence that might result in a creation of a new block for
that table. Sorry, if I am missing something here.

Please note that I haven't gone through all the emails in this thread so
there is a possibility that I might have repeated the question that has
already been raised and answered here. If that is the case, I am sorry for
that, but it would be very helpful if someone could point out that thread
so that I can go through it. Thank you.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Fri, Jun 12, 2020 at 11:01 AM vignesh C  wrote:

> On Thu, Jun 4, 2020 at 12:44 AM Andres Freund  wrote
> >
> >
> > Hm. you don't explicitly mention that in your design, but given how
> > small the benefits going from 0-1 workers is, I assume the leader
> > doesn't do any "chunk processing" on its own?
> >
>
> Yes you are right, the leader does not do any processing, Leader's
> work is mainly to populate the shared memory with the offset
> information for each record.
>
> >
> >
> > > Design of the Parallel Copy: The backend, to which the "COPY FROM"
> query is
> > > submitted acts as leader with the responsibility of reading data from
> the
> > > file/stdin, launching at most n number of workers as specified with
> > > PARALLEL 'n' option in the "COPY FROM" query. The leader populates the
> > > common data required for the workers execution in the DSM and shares it
> > > with the workers. The leader then executes before statement triggers if
> > > there exists any. Leader populates DSM chunks which includes the start
> > > offset and chunk size, while populating the chunks it reads as many
> blocks
> > > as required into the DSM data blocks from the file. Each block is of
> 64K
> > > size. The leader parses the data to identify a chunk, the existing
> logic
> > > from CopyReadLineText which identifies the chunks with some changes was
> > > used for this. Leader checks if a free chunk is available to copy the
> > > information, if there is no free chunk it waits till the required
> chunk is
> > > freed up by the worker and then copies the identified chunks
> information
> > > (offset & chunk size) into the DSM chunks. This process is repeated
> till
> > > the complete file is processed. Simultaneously, the workers cache the
> > > chunks(50) locally into the local memory and release the chunks to the
> > > leader for further populating. Each worker processes the chunk that it
> > > cached and inserts it into the table. The leader waits till all the
> chunks
> > > populated are processed by the workers and exits.
> >
> > Why do we need the local copy of 50 chunks? Copying memory around is far
> > from free. I don't see why it'd be better to add per-process caching,
> > rather than making the DSM bigger? I can see some benefit in marking
> > multiple chunks as being processed with one lock acquisition, but I
> > don't think adding a memory copy is a good idea.
>
> We had run performance with  csv data file, 5.1GB, 10million tuples, 2
> indexes on integer columns, results for the same are

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 11:38 AM Dilip Kumar  wrote:
>
> - Currently, while reading/writing the streaming/subxact files we are
> reporting the wait event for example
> 'pgstat_report_wait_start(WAIT_EVENT_LOGICAL_SUBXACT_WRITE);',  but
> BufFileWrite/BufFileRead internally reports the read/write wait event.
> So I think we can avoid reporting that?
>

Yes, we can avoid that.  No other place using BufFileRead does any
such reporting.

>  Basically, this part is still
> I have to work upon, once we get the consensus then I can remove those
> extra wait event from the patch.
>

Okay, feel free to send an updated patch with the above change.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 2:10 PM Masahiko Sawada
 wrote:
>
> On Fri, 12 Jun 2020 at 15:37, Amit Kapila  wrote:
> >
> > > >
> > > > I think this is a corner case and it is better to simplify the state
> > > > recording of foreign transactions then to save a CLOG lookup.
> > > >
> > >
> > > The main usage of in-doubt flag is to distinguish between in-doubt
> > > transactions and other transactions that have their waiter (I call
> > > on-line transactions).
> > >
> >
> > Which are these other online transactions?  I had assumed that foreign
> > transaction resolver process is to resolve in-doubt transactions but
> > it seems it is also used for some other purpose which anyway was the
> > next question I had while reviewing other sections of docs but let's
> > clarify as it came up now.
>
> When a distributed transaction is committed by COMMIT command, the
> postgres backend process prepare all foreign transaction and commit
> the local transaction.
>

Does this mean that we will mark the xid as committed in CLOG of the
local server?  If so, why is this okay till we commit transactions in
all the foreign servers, what if we fail to commit on one of the
servers?

Few more comments on v22-0003-Documentation-update
--
1.
+  When disabled there can be risk of database
+  consistency among all servers that involved in the distributed
+  transaction when some foreign server crashes during committing the
+  distributed transaction.

Will it read better if rephrase above to something like: "When
disabled there can be a risk of database
consistency if one or more foreign servers crashes while committing
the distributed transaction."?

2.
+  
+   foreign_transaction_resolution_retry_interval
(integer)
+
+ foreign_transaction_resolution_interval
configuration parameter
+
+   
+   
+
+ Specify how long the foreign transaction resolver should
wait when the last resolution
+ fails before retrying to resolve foreign transaction. This
parameter can only be set in the
+ postgresql.conf file or on the server
command line.
+
+
+ The default value is 10 seconds.
+
+   
+  

Typo.  
+   foreign_transaction_resolver_timeout
(integer)
+
+ foreign_transaction_resolver_timeout
configuration parameter
+
+   
+   
+
+ Terminate foreign transaction resolver processes that don't
have any foreign
+ transactions to resolve longer than the specified number of
milliseconds.
+ A value of zero disables the timeout mechanism, meaning it
connects to one
+ database until stopping manually.

Can we mention the function name using which one can stop the resolver process?

4.
+   Using the PostgreSQL's atomic commit ensures that
+   all changes on foreign servers end in either commit or rollback using the
+   transaction callback routines

Can we slightly rephase this "Using the PostgreSQL's atomic commit
ensures that all the changes on foreign servers are either committed
or rolled back using the transaction callback routines"?

5.
+   Prepare all transactions on foreign servers.
+   PostgreSQL distributed transaction manager
+   prepares all transaction on the foreign servers if two-phase commit is
+   required. Two-phase commit is required when the transaction modifies
+   data on two or more servers including the local server itself and
+is
+   required.

/PostgreSQL/PostgreSQL's.

 If all preparations on foreign servers got
+   successful go to the next step.

How about "If the prepare on all foreign servers is successful then go
to the next step"?

 Any failure happens in this step,
+   the server changes to rollback, then rollback all transactions on both
+   local and foreign servers.

Can we rephrase this line to something like: "If there is any failure
in the prepare phase, the server will rollback all the transactions on
both local and foreign servers."?

What if the issued Rollback also failed, say due to network breakdown
between local and one of foreign servers?  Shouldn't such a
transaction be 'in-doubt' state?

6.
+  
+   Commit locally. The server commits transaction locally.  Any
failure happens
+   in this step the server changes to rollback, then rollback all
transactions
+   on both local and foreign servers.
+  
+ 
+ 
+  
+   Resolve all prepared transaction on foreign servers. Pprepared
transactions
+   are committed or rolled back according to the result of the
local transaction.
+   This step is normally performed by a foreign transaction
resolver process.
+  

When (in which step) do we commit on foreign servers?  Do Resolver
processes commit on foreign servers, if so, how can we commit locally
without committing on foreign servers,

Re: Building PostgreSQL extensions on Windows

2020-06-12 Thread Joe Conway

On 6/11/20 6:42 PM, David Rowley wrote:
> I've heard from a few people that building PostgreSQL extensions on
> Windows is a bit of a pain. I've heard from these people that their
> solution was to temporarily add their extension as a contrib module
> and have the extension building code take care of creating and
> building the Visual Studio project file.


Yep -- that is exactly how I have been building PL/R on Windows for many years,
and it is painful.


> I thought about how we might improve this situation.  The easiest way
> I could think to do this was to just reuse the code that builds the
> Visual Studio project files for contrib modules and write a Perl
> script which calls those functions. Now, these functions, for those
> who have never looked there before, they do use the PGXS compatible
> Makefile as a source of truth and build the VS project file from that.
> I've attached a very rough PoC patch which attempts to do this.
> 
> The script just takes the directory of the Makefile as the first
> argument, and optionally the path to pg_config.exe as the 2nd
> argument.  If that happens to be in the PATH environment variable then
> that can be left out.
> 
> You end up with:
> 
> X:\pg_src\src\tools\msvc>perl extbuild.pl
> X:\pg_src\contrib\auto_explain X:\pg\bin
> Makefile dir = X:\pg_src\contrib\auto_explain
> Postgres include dir = X:\pg\include
> Building = Release
> Detected hardware platform: x64
> 
> ...
> 
> Build succeeded.
> 0 Warning(s)
> 0 Error(s)
> 
> Time Elapsed 00:00:01.13
> 
> For now, I've only tested this on a few contrib modules. It does need
> more work to properly build ones with a list of "MODULES" in the
> Makefile. It seems to work ok on the MODULE_big ones that I tested. It
> needs a bit more work to get the resource file paths working properly
> for PROGRAM.
> 
> Before I go and invest more time in this, I'd like to get community
> feedback about the idea. Is this something that we'd want? Does it
> seem maintainable enough to have in core?  Is there a better way to do
> it?


Sounds very useful to me -- I'll give it a try with PL/R this weekend.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: POC and rebased patch for CSN based snapshots

2020-06-12 Thread movead...@highgo.ca

Hello hackers,

Currently, I do some changes based on the last version:
1. Catch up to the current  commit (c2bd1fec32ab54).
2. Add regression and document.
3. Add support to switch from xid-base snapshot to csn-base snapshot,
and the same with standby side.



Regards,
Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca





0001-CSN-base-snapshot.patch
Description: Binary data


0002-Wal-for-csn.patch
Description: Binary data


0003-snapshot-switch.patch
Description: Binary data

Re: Add tap test for --extra-float-digits option

2020-06-12 Thread Dong Wook Lee

Oh, now I understand. and I added a test of --row-per-insert option.
I'd better find more options missing test

2020년 6월 12일 (금) 오후 4:04, Michael Paquier 님이 작성:

> On Fri, Jun 12, 2020 at 02:30:35PM +0900, Dong Wook Lee wrote:
> > Thank you for your response
> > Do you mean to move it under the test of --compression option?
>
> You could move the test where you see is best, and I would have done
> that.  My point is that we could have a test also for
> --rows-per-insert as it deals with the same problem.
> --
> Michael
>


0002-pg_dump-Add-test-for-rows-per-insert-option.patch
Description: Binary data

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-12 Thread Masahiko Sawada

On Fri, 12 Jun 2020 at 15:37, Amit Kapila  wrote:
>
> On Fri, Jun 12, 2020 at 9:54 AM Masahiko Sawada
>  wrote:
> >
> > On Fri, 12 Jun 2020 at 12:40, Amit Kapila  wrote:
> > >
> > > On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> > >  wrote:
> > > >
> > > > On Thu, 11 Jun 2020 at 22:21, Amit Kapila  
> > > > wrote:
> > > > >
> > > >
> > > > >  I have another question about
> > > > > this field, why can't it be one of the status ('preparing',
> > > > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > > > separate field?
> > > >
> > > > Because I'm using in-doubt field also for checking if the foreign
> > > > transaction entry can also be resolved manually, i.g.
> > > > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > > > status = 'prepared' and in-doubt = 'true' can be resolved either
> > > > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > > > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > > > sets status = 'committing' (or 'rollbacking') by checking transaction
> > > > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > > > resolution. In this case, the foreign transaction is still status =
> > > > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > > > resolver process processes the foreign transaction, it can commit it
> > > > without clog looking.
> > > >
> > >
> > > I think this is a corner case and it is better to simplify the state
> > > recording of foreign transactions then to save a CLOG lookup.
> > >
> >
> > The main usage of in-doubt flag is to distinguish between in-doubt
> > transactions and other transactions that have their waiter (I call
> > on-line transactions).
> >
>
> Which are these other online transactions?  I had assumed that foreign
> transaction resolver process is to resolve in-doubt transactions but
> it seems it is also used for some other purpose which anyway was the
> next question I had while reviewing other sections of docs but let's
> clarify as it came up now.

When a distributed transaction is committed by COMMIT command, the
postgres backend process prepare all foreign transaction and commit
the local transaction. Then the backend enqueue itself to the shmem
queue, asks a resolver process for committing the prepared foreign
transaction, and wait. That is, these prepared foreign transactions
are committed by the resolver process, not backend process. Once the
resolver process committed all prepared foreign transactions, it wakes
the waiting backend process. I meant this kind of transaction is
on-line transactions. This procedure is similar to what synchronous
replication does.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Resetting spilled txn statistics in pg_stat_replication

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 11:20 AM Masahiko Sawada
 wrote:
>
> On Fri, 12 Jun 2020 at 12:21, Amit Kapila  wrote:
> >
> > > Since the logical decoding intermediate files are written at per slots
> > > directory, I thought that corresponding these statistics to
> > > replication slots is also understandable for users.
> > >
> >
> > What I wanted to know is how will it help users to tune
> > logical_decoding_work_mem?  Different backends can process from the
> > same slot, so it is not clear how user will be able to make any
> > meaning out of those stats.
>
> I thought that the user needs to constantly monitor them during one
> process is executing logical decoding and to see the increments. I
> might not fully understand but I guess the same is true for displaying
> them w.r.t. process. Since a process can do logical decoding several
> times using the same slot with a different setting, the user will need
> to monitor them several times.
>

Yeah, I think we might not be able to get exact measure but if we
divide total_size spilled by exec_count, we will get some rough idea
of what should be the logical_decoding_work_mem for that particular
session.  For ex. consider the logical_decoding_work_mem is 100bytes
for a particular backend and the size spilled by that backend is 100
then I think you can roughly keep it to 200bytes if you want to avoid
spilling.  Similarly one can compute its average value over multiple
executions.  Does this make sense to you?

> > OTOH, it is easier to see how to make
> > meaning of these stats if we display them w.r.t process.  Basically,
> > we have spill_count and spill_size which can be used to tune
> > logical_decoding_work_mem and also the activity of spilling happens at
> > process level, so it sounds like one-to-one mapping.
>
> Displaying them w.r.t process also seems a good idea but I'm still
> unclear what to display and how long these values are valid.
>

I feel till the lifetime of a process if we want to display the values
at process level but I am open to hear others (including yours) views
on this.

> The view
> will have the following columns for example?
>
> * pid
> * slot_name
> * spill_txns
> * spill_count
> * spill_bytes
> * exec_count
>

Yeah, these appear to be what I have in mind.  Note that we can have
multiple entries of the same pid here because of slotname, there is
some value to display slotname but I am not completely sure if that is
a good idea but I am fine if you have a reason to include slotname?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Corruption during WAL replay

2020-06-12 Thread Masahiko Sawada

On Wed, 15 Apr 2020 at 04:04, Teja Mupparti  wrote:
>
> Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on 
> the critical-section around truncate, but I just want to emphasize the need 
> for reversing the order of the dropping the buffers and the truncation.
>
>  Repro details (when full page write = off)
>
>  1) Page on disk has empty LP 1, Insert into page LP 1
>  2) checkpoint START (Recovery REDO eventually starts here)
>  3) Delete all rows on the page (page is empty now)
>  4) Autovacuum kicks in and truncates the pages
>  DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on 
> disk still empty
>  5) Checkpoint completes
>  6) Crash
>  7) smgrtruncate - Not reached (this is where we do the physical 
> truncate)
>
>  Now the crash-recovery starts
>
>  Delete-log-replay (above step-3) reads page with empty LP 1 and the 
> delete fails with PANIC (old page on disk with no insert)
>

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

回复：how to create index concurrently on paritioned table

2020-06-12 Thread 李杰(慎追)

Hi Justin,

> Maybe I'm wrong, but I don't think there's any known difficulty - just that> 
> nobody did it yet.  You should pay attention to what happens on error, but
> hopefully you wouldn't need to add much code and can rely on existing code to
> paths to handle that right.

yes,  I am trying to learn the code of index definition.

> I think you'd look at the commits and code implementing indexes on partitioned
> tables and CREATE INDEX CONCURRENTLY.  And maybe any following commits with
> fixes.

> You'd first loop around all children (recursively if there are partitions 
> which
> are themselves partitioned) and create indexes concurrently. 

As we all know, CIC has three transactions. If we recursively in n partitioned 
tables, 
it will become 3N transactions. If an error occurs in these transactions, we 
have too many things to deal...

If an error occurs when an index is created in one of the partitions, 
what should we do with our new index?


Thank you very much,
 Regards,  Adger

回复：how to create index concurrently on paritioned table

2020-06-12 Thread 李杰(慎追)


>On Sat, Jun 06, 2020 at 09:23:32AM -0500, Justin Pryzby wrote:
> > On Wed, Jun 03, 2020 at 08:22:29PM +0800, 李杰(慎追) wrote:
> > > Partitioning is necessary for very large tables.
> > >  However, I found that postgresql does not support create index 
> > > concurrently on partitioned tables.
> > > The document show that we need to create an index on each partition 
> > > individually and then finally create the partitioned index 
> > > non-concurrently. 
> > > This is undoubtedly a complex operation for DBA, especially when there 
> > > are many partitions. 
> > 
> > > Therefore, I wonder why pg does not support concurrent index creation on 
> > > partitioned tables? 
> > > What are the difficulties of this function? 
> > > If I want to implement it, what should I pay attention?
> > 
> > Maybe I'm wrong, but I don't think there's any known difficulty - just that
> > nobody did it yet.

> I said that but I was actually thinking about the code for "REINDEX
> CONCURRENTLY" (which should also handle partitioned tables).

> I looked at CIC now and came up with the attached.  All that's needed to allow
> this case is to close the relation before recursing to partitions - it needs 
> to
> be closed before calling CommitTransactionCommand().  There's probably a 
> better
> way to write this, but I can't see that there's anything complicated about
> handling partitioned tables.

Hi, Justin Pryzby

I'm so sorry about getting back late.
Thank you very much for helping me consider this issue.
I compiled the patch v1 you provided. And I patch v2-001 again to enter 
postgresql.
I got a coredump that was easy to reproduce. As follows:

#0  PopActiveSnapshot () at snapmgr.c:822
#1  0x005ca687 in DefineIndex (relationId=relationId@entry=16400, 
stmt=stmt@entry=0x1aa5e28, indexRelationId=16408, indexRelationId@entry=0, 
parentIndexId=parentIndexId@entry=16406,
  parentConstraintId=0, is_alter_table=is_alter_table@entry=false,
 check_rights=true, check_not_in_use=true, skip_build=false, quiet=false) 
at indexcmds.c:1426
#2  0x005ca5ab in DefineIndex (relationId=relationId@entry=16384, 
stmt=stmt@entry=0x1b35278, indexRelationId=16406, indexRelationId@entry=0,
 parentIndexId=parentIndexId@entry=0,
  parentConstraintId=parentConstraintId@entry=0, is_alter_table=
is_alter_table@entry=false, check_rights=true, check_not_in_use=true,
 skip_build=false, quiet=false) at indexcmds.c:1329
#3  0x0076bf80 in ProcessUtilitySlow (pstate=pstate@entry=0x1b350c8, 
pstmt=pstmt@entry=0x1a2bf40,
  queryString=queryString@entry=0x1a2b2c8 "create index CONCURRENTLY 
idxpart_a_idx on idxpart (a);", context=context@entry=PROCESS_UTILITY_TOPLEVEL, 
params=params@entry=0x0,
  queryEnv=queryEnv@entry=0x0, qc=0x7ffc86cc7630, dest=0x1a2c200) at 
utility.c:1474
#4  0x0076afeb in standard_ProcessUtility (pstmt=0x1a2bf40, 
queryString=0x1a2b2c8 "create index CONCURRENTLY idxpart_a_idx on idxpart (a);",
 context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
  queryEnv=0x0, dest=0x1a2c200, qc=0x7ffc86cc7630) at utility.c:1069
#5  0x00768992 in PortalRunUtility (portal=0x1a8d1f8, pstmt=0x1a2bf40, 
isTopLevel=, setHoldSnapshot=, dest=,
 qc=0x7ffc86cc7630) at pquery.c:1157
#6  0x007693f3 in PortalRunMulti (portal=portal@entry=0x1a8d1f8, 
isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x1a2c200,
  altdest=altdest@entry=0x1a2c200, qc=qc@entry=0x7ffc86cc7630) at pquery.c:1310
#7  0x00769ed3 in PortalRun (portal=portal@entry=0x1a8d1f8, count=count
@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once
@entry=true, dest=dest@entry=0x1a2c200,
  altdest=altdest@entry=0x1a2c200, qc=0x7ffc86cc7630) at pquery.c:779
#8  0x00765b06 in exec_simple_query (query_string=0x1a2b2c8 "create 
index CONCURRENTLY idxpart_a_idx on idxpart (a);") at postgres.c:1239
#9  0x00767de5 in PostgresMain (argc=, argv=argv@entry
=0x1a552c8, dbname=, username=) at postgres.c:4315
#10 0x006f2b23 in BackendRun (port=0x1a4d1e0, port=0x1a4d1e0) at 
postmaster.c:4523
#11 BackendStartup (port=0x1a4d1e0) at postmaster.c:4215
#12 ServerLoop () at postmaster.c:1727
#13 0x006f3a1f in PostmasterMain (argc=argc@entry=3, 
argv=argv@entry=0x1a25ea0) 
at postmaster.c:1400
#14 0x004857f9 in main (argc=3, argv=0x1a25ea0) at main.c:210


You can re-produce it like this:
```
create table idxpart (a int, b int, c text) partition by range (a);
create table idxpart1 partition of idxpart for values from (0) to (10);
create table idxpart2 partition of idxpart for values from (10) to (20);
create index CONCURRENTLY idxpart_a_idx on idxpart (a);


I have been trying to get familiar with the source code of create index.
Can you solve this bug first? I will try my best to implement CIC with you.
Next, I will read your patchs v2-002 and v2-003.

Thank you very much,
 Regards,  Adger

Re: doc review for v13

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 09:37:09PM -0500, Justin Pryzby wrote:
> Some new bits,
> And some old ones.

I was looking at this patch set, and 0005 has attracted my attention
here:

> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -4240,7 +4240,6 @@ AttrDefaultFetch(Relation relation)
>   HeapTuple   htup;
>   Datum   val;
>   boolisnull;
> - int found;
>   int i;

Since 16828d5, this variable is indeed unused.  Now, the same commit
has removed the following code:
-   if (found != ndef)
-   elog(WARNING, "%d attrdef record(s) missing for rel %s",
-ndef - found, RelationGetRelationName(relation));

Should we actually keep this variable and have this sanity check in
place?  It seems to me that it would be good to have that, so as we
can make sure that the number of default attributes cached matches
with the number of defaults actually found when scanning each
attribute.  Adding in CC Andrew as the author of 16828d5 for more
input.
--
Michael


signature.asc
Description: PGP signature

Re: remove some STATUS_* symbols

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 03:55:59PM +0200, Peter Eisentraut wrote:
> On 2020-01-16 13:56, Robert Haas wrote:
>> IMHO, custom enums for each particular case would be a big improvement
>> over supposedly-generic STATUS codes. It makes it clearer which values
>> are possible in each code path, and it comes out nicer in the
>> debugger, too.
> 
> Given this feedback, I would like to re-propose the original patch, attached
> again here.
> 
> After this, the use of the remaining STATUS_* symbols will be contained to
> the frontend and backend libpq code, so it'll be more coherent.

I am still in a so-so state regarding this patch, but I find the
debugger argument a good one.  And please don't consider me as a
blocker.

> Add a separate enum for use in the locking APIs, which were the only
> user.

> +typedef enum
> +{
> + PROC_WAIT_STATUS_OK,
> + PROC_WAIT_STATUS_WAITING,
> + PROC_WAIT_STATUS_ERROR,
> +} ProcWaitStatus;

ProcWaitStatus, and more particularly PROC_WAIT_STATUS_WAITING are
strange names (the latter refers to "wait" twice).  What do you think
about renaming the enum to ProcStatus and the flags to PROC_STATUS_*?
--
Michael


signature.asc
Description: PGP signature

Re: Infinities in type numeric

2020-06-12 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 Tom> * I'm only about 50% sure that I understand what the sort
 Tom> abbreviation code is doing. A quick look from Peter or some other
 Tom> expert would be helpful.

That code was originally mine, so I'll look at it.

-- 
Andrew (irc:RhodiumToad)

Re: Make more use of RELKIND_HAS_STORAGE()

2020-06-12 Thread Michael Paquier

On Fri, Jun 12, 2020 at 09:16:04AM +0200, Peter Eisentraut wrote:
> committed

Yeah!
--
Michael


signature.asc
Description: PGP signature

Re: how to create index concurrently on partitioned table

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 10:35:02AM -0500, Justin Pryzby wrote:
> Note, you could do this now using psql like:
> SELECT format('CREATE INDEX CONCURRENTLY ... ON %s(col)', a::regclass) FROM 
> pg_partition_ancestors() AS a;
> \gexec

I have skimmed quickly through the patch set, and something has caught
my attention.

>  drop table idxpart;
> --- Some unsupported features
> +-- CIC on partitioned table
>  create table idxpart (a int, b int, c text) partition by range (a);
>  create table idxpart1 partition of idxpart for values from (0) to (10);
>  create index concurrently on idxpart (a);
> -ERROR:  cannot create index on partitioned table "idxpart" concurrently
> +\d idxpart1

When it comes to test behaviors specific to partitioning, there are in
my experience three things to be careful about and stress in the tests:
- Use at least two layers of partitioning.
- Include into the partition tree a partition that has no leaf
partitions.
- Test the commands on the top-most parent, a member in the middle of
the partition tree, the partition with no leaves, and one leaf, making
sure that relfilenode changes where it should and that partition trees
remain intact (you can use pg_partition_tree() for that.)

That's to say that the amount of regression tests added here is not
sufficient in my opinion.
--
Michael


signature.asc
Description: PGP signature

Re: Internal key management system

2020-06-12 Thread Fabien COELHO




Hello Masahiko-san,

I'm not sure I understood your concern. I try to answer below.


If I understand your idea correctly we put both DEK and KEK
"elsewhere", and a postgres process gets only DEK from it.


Yes, that is one of the option.

It seems to me this idea assumes that the place storing encryption keys 
employees 2-tire key hierarchy or similar thing.


ISTM that there is no such assumption. There is the assumption that there 
is an interface to retrieve DEK. What is done being the interface to 
retrieve this DEK should be irrelevant to pg. Having them secure by a 
KEK looks like an reasonable design, though. Maybe keys are actually 
stored. Maybe thay are computed based on something, eg key identifier and 
some secret. Maybe there is indeed a 2-tier something. Maybe whatever.



What if the user wants to or has to manage a single encryption key?


Then it has one key identifier and it retrieve one key from the DMS. 
Having a "management system" for a singleton looks like overkill though, 
but it should work.


For example, storing an encryption key for PostgreSQL TDE into a file in 
a safe server instead of KMS using DEK and KEK because of budgets or 
requirements whatever.


Good. If you have an interface to retrieve a key, then it can probably 
contact said server to get it when needed?



In this case, if the user does key rotation, that encryption key would
need to be rotated, resulting in the user would need to re-encrypt all
database data encrypted with old key.


Sure, by definition actually changing the key requires a 
decryption/encryption cycle on all data.


It should work but what do you think about how postgres does key 
rotation and re-encryption?


If pg actually has the DEK, then it means that while the re-encryption is 
performed it has to manage two keys simultenaously, this is a question for 
what is done on pg server with the keys, not really about the DMS ?


If the "elsewhere" service does the encryption, maybe the protocol could 
include it, eg something like:


REC key1-id key2-id data-encrypted-with-key1
 -> data-encrypted-with-key2

But it could also achieve the same thing with two commands, eg:

DEC key1-id data-encrypted-with-key1
 -> clear-text-data

ENC key2-id clear-text-data
 -> data-encrypted-with-key2

The question is what should be put in the protocol, and I would tend to 
think that some careful design time should be put in it.


--
Fabien.

Re: Make more use of RELKIND_HAS_STORAGE()

2020-06-12 Thread Peter Eisentraut


On 2020-06-05 18:05, Tom Lane wrote:

Peter Eisentraut  writes:

This is a patch to make use of RELKIND_HAS_STORAGE() where appropriate,
instead of listing out the relkinds individually.  No behavior change is
intended.
This was originally part of the patch from [0], but it seems worth
moving forward independently.


Passes eyeball examination.  I did not try to search for other places
where RELKIND_HAS_STORAGE should be used.


committed

--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pg_upgrade fails with non-standard ACL

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 07:58:43PM +0300, Anastasia Lubennikova wrote:
> I would be glad to add some test, but it seems to me that the infrastructure
> changes for cross-version pg_upgrade test is much more complicated task than
> this modest bugfix.  Besides, I've read the discussion and it seems that
> Michael
> is not going to continue this work.

The main issue I recall from this patch series was the lack of
enthusiasm because it would break potential users running major
upgrade tests based on test.sh.  At the same time, if you don't break
the wheel..

> Attached v10 patch contains more fix for uninitialized variable.

Thanks.  Sorry for the time it takes.  I'd like to get into this issue
but I have not been able to dive into it seriously yet.
--
Michael


signature.asc
Description: PGP signature

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-12 Thread Michael Paquier

On Thu, Jun 11, 2020 at 10:16:50AM -0400, Stephen Frost wrote:
> Uh, doesn't this pull these functions out of libpgcommon, where they
> might have been being used by utilities outside of our client tools?

This stuff is new to 13, and we are still in beta so it is fine in my
opinion to still change things how we think is best.  Suggesting to
change things once we are officially GA would not be possible.
--
Michael


signature.asc
Description: PGP signature

Re: TAP tests and symlinks on Windows

2020-06-12 Thread Michael Paquier

On Tue, Jun 09, 2020 at 11:26:19AM +0100, Dagfinn Ilmari Mannsåker wrote:
> Amusingly, Win32::Symlink uses a copy of our pgsymlink(), which emulates
> symlinks via junction points:
> 
> https://metacpan.org/source/AUDREYT/Win32-Symlink-0.06/pgsymlink.c

Oh, interesting point.  Thanks for the reference!

> A portable way of using symlinks if possible would be:
> 
> # In a BEGIN block because it overrides CORE::GLOBAL::symlink, which
> # only takes effect on code that's compiled after the override is
> # installed.  We don't care if it fails, since it works without on
> # some Windows perls.
> [...]
> 
> Plus a note in the Win32 docs that Win32::Symlink may be required to run
> some tests on some Perl/Windows versions..

Planting such a check in individual scripts is not a good idea because
it would get forgotten.  The best way to handle that is to add a new
check in the BEGIN block of TestLib.pm.  Note that we already do that
with createFile, OsFHandleOpen and CloseHandle.  Now the question is:
do we really want to make this a hard requirement?  I would like to
answer yes so as we make sure that this gets always tested, and this
needs proper documentation as you say.  Now it would be also possible
to check if the API is present in the BEGIN block of TestLib.pm, and
then use an independent variable similar to what we do with
$use_unix_sockets to decide if tests should be skipped or not, but you
cannot know if this gets actually, or ever, tested.
--
Michael


signature.asc
Description: PGP signature

Re: Transactions involving multiple postgres foreign servers, take 2

2020-06-12 Thread Amit Kapila

On Fri, Jun 12, 2020 at 9:54 AM Masahiko Sawada
 wrote:
>
> On Fri, 12 Jun 2020 at 12:40, Amit Kapila  wrote:
> >
> > On Fri, Jun 12, 2020 at 7:59 AM Masahiko Sawada
> >  wrote:
> > >
> > > On Thu, 11 Jun 2020 at 22:21, Amit Kapila  wrote:
> > > >
> > >
> > > >  I have another question about
> > > > this field, why can't it be one of the status ('preparing',
> > > > 'prepared', 'committing', 'aborting', 'in-doubt') rather than having a
> > > > separate field?
> > >
> > > Because I'm using in-doubt field also for checking if the foreign
> > > transaction entry can also be resolved manually, i.g.
> > > pg_resolve_foreign_xact(). For instance, a foreign transaction which
> > > status = 'prepared' and in-doubt = 'true' can be resolved either
> > > foreign transaction resolver or pg_resolve_foreign_xact(). When a user
> > > execute pg_resolve_foreign_xact() against the foreign transaction, it
> > > sets status = 'committing' (or 'rollbacking') by checking transaction
> > > status in clog. The user might cancel pg_resolve_foreign_xact() during
> > > resolution. In this case, the foreign transaction is still status =
> > > 'committing' and in-doubt = 'true'. Then if a foreign transaction
> > > resolver process processes the foreign transaction, it can commit it
> > > without clog looking.
> > >
> >
> > I think this is a corner case and it is better to simplify the state
> > recording of foreign transactions then to save a CLOG lookup.
> >
>
> The main usage of in-doubt flag is to distinguish between in-doubt
> transactions and other transactions that have their waiter (I call
> on-line transactions).
>

Which are these other online transactions?  I had assumed that foreign
transaction resolver process is to resolve in-doubt transactions but
it seems it is also used for some other purpose which anyway was the
next question I had while reviewing other sections of docs but let's
clarify as it came up now.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: new heapcheck contrib module

2020-06-12 Thread Dilip Kumar

On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
 wrote:
>
>
>
> > On Jun 11, 2020, at 9:14 AM, Dilip Kumar  wrote:
> >
> > I have just browsed through the patch and the idea is quite
> > interesting.  I think we can expand it to check that whether the flags
> > set in the infomask are sane or not w.r.t other flags and xid status.
> > Some examples are
> >
> > - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> > should not be set in new_infomask2.
> > - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> > actually cross verify the transaction status from the CLOG and check
> > whether is matching the hint bit or not.
> >
> > While browsing through the code I could not find that we are doing
> > this kind of check,  ignore if we are already checking this.
>
> Thanks for taking a look!
>
> Having both of those bits set simultaneously appears to fall into a different 
> category than what I wrote verify_heapam.c to detect.

Ok

  It doesn't violate any assertion in the backend, nor does it cause
the code to crash.  (At least, I don't immediately see how it does
either of those things.)  At first glance it appears invalid to have
those bits both set simultaneously, but I'm hesitant to enforce that
without good reason.  If it is a good thing to enforce, should we also
change the backend code to Assert?

Yeah, it may not hit assert or crash but it could lead to a wrong
result.  But I agree that it could be an assertion in the backend
code.  What about the other check, like hint bit is saying the
transaction is committed but actually as per the clog the status is
something else.  I think in general processing it is hard to check
such things in backend no? because if the hint bit is set saying that
the transaction is committed then we will directly check its
visibility with the snapshot.  I think a corruption checker may be a
good tool for catching such anomalies.

> I integrated your idea into one of the regression tests.  It now sets these 
> two bits in the header of one of the rows in a table.  The verify_heapam 
> check output (which includes all detected corruptions) does not change, which 
> verifies your observation that verifies _heapam is not checking for this.  
> I've attached that as a patch to this email.  Note that this patch should be 
> applied atop the v6 patch recently posted in another email.

Ok.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

RE: BUG #16481: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2020-06-12 Thread Vianello Fabio

Just to help those who find themselves in the same situation, there is a simple 
application-level workaround which consists in listening to the notifications 
in the replica when they are issued by the master and vice versa.

We are programming in .NET and we use a code like this:

Code in the replica side:

   var csb = new NpgsqlConnectionStringBuilder
{
Host = "master",
Database = "MasterDB",
Port = 5432,
Username = "postgres",
Password = "XXX",

};
var connection = new NpgsqlConnection(csb.ConnectionString);
connection.Open();
using (var command = new NpgsqlCommand("listen \"Table\"", 
connection))
{
command.ExecuteNonQuery();
}
connection.Notification += PostgresNotification;

So you can listen from the replica every changed raised by a trigger on the 
master from the replica side on the table "Table".

CREATE TRIGGER master_trigger
AFTER INSERT OR DELETE OR UPDATE
ON public."TABLE"
FOR EACH ROW
EXECUTE PROCEDURE public.master_notice();

ALTER TABLE public."Tabele"
ENABLE ALWAYS TRIGGER master_trigger;

CREATE FUNCTION public. master_notice ()
RETURNS trigger
LANGUAGE 'plpgsql'
COST 100
VOLATILE NOT LEAKPROOF
AS $BODY$
  BEGIN
  PERFORM  pg_notify('Table', Cast(NEW."ID"  as text));
  RETURN NEW;
  END;
  $BODY$;

ALTER FUNCTION public.incupdate1_notice()
OWNER TO postgres;

I hope that help someone, because the bug last from "years". I tried in version 
10 11 and 12, so it is present since 2017-10-05 and I can't see any solution on 
13 beta.

Best Regards.
Fabio.


-Original Message-
From: Vianello Fabio
Sent: lunedì 8 giugno 2020 11:14
To: Kyotaro Horiguchi 
Cc: pgsql-b...@lists.postgresql.org; pgsql-hackers@lists.postgresql.org
Subject: RE: BUG #16481: Stored Procedure Triggered by Logical Replication is 
Unable to use Notification Events

Hi Kyotaro Horiguchi, thanks for you helps.
We have a question about the bug. Why there isn't any solution in the HEAD?

This bug last since 10.4 version and I can't understand why it is not fixed in 
the HEAD  yet.

BR.
Fabio Vianello.


-Original Message-
From: Kyotaro Horiguchi [mailto:horikyota@gmail.com]
Sent: lunedì 8 giugno 2020 10:28
To: Vianello Fabio ; 
pgsql-b...@lists.postgresql.org; pgsql-hackers@lists.postgresql.org
Subject: Re: BUG #16481: Stored Procedure Triggered by Logical Replication is 
Unable to use Notification Events

Hello.

It seems to me a bug.

At Fri, 05 Jun 2020 11:05:14 +, PG Bug reporting form 
 wrote in
> The following bug has been logged on the website:
>
> Bug reference:  16481
> Logged by:  Fabio Vianello
> Email address:  fabio.viane...@salvagninigroup.com
> PostgreSQL version: 12.3
> Operating system:   Windows 10
> Description:
>
> About the bug BUG #15293, on PostgreSQL version 10.4 and 11.2 as
> describe below, we found the same issue on the PostgreSQL version 12.3.

The HEAD behaves the same way.

> Is it a feature?
> Becasue in the documentation we didn't found any constraint that says
> that we can not use NOTIFY/LISTEN on logical replication tables.
>
> "When using logical replication a stored procedure executed on the
> replica is unable to use NOTIFY to send messages to other listeners.
> The stored procedure does execute as expected but the pg_notify()
> doesn't appear to have any effect. If an insert is run on the replica
> side the trigger executes the stored procedure as expected and the
> NOTIFY correctly notifies listeners.

The message is actually queued, but logical replication worker doesn't signal 
that to listener backends. If any ordinary session sent a message to the same 
listener after that, both messages would be shown at once.

That can be fixed by calling ProcessCompletedNotifies() in apply_handle_commit. 
The function has a code to write out notifications to connected clients but it 
doesn't nothing on logical replication workers.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
SALVAGNINI ITALIA S.p.A.
Via Guido Salvagnini, 51 - IT - 36040 Sarego (VI)
T. +39 0444 725111 | F. +39 0444 43 6404
Società a socio unico - Attività direz. e coord.: Salvagnini Holding S.p.A.
Clicca qui per le 
informazioni societarie
salvagninigroup.com | 
salvagnini.it


Le informazioni trasmesse sono destinate esclusivamente alla persona o alla 
società in indirizzo e sono da intendersi confidenziali e riservate. Ogni 
trasmissione, inoltro, diffusione o altro uso di queste informazioni a persone 
o società differenti dal destinatario è proibita. Se avete ricevuto questa 
comunicazione per errore, per favore contattate il mittente e cancellate le 
informazioni da ogni computer. Questa casella di posta elettronica è

Re: Internal key management system

2020-06-12 Thread Masahiko Sawada

On Thu, 11 Jun 2020 at 20:03, Fabien COELHO  wrote:
>
>
> Hello Masahiko-san,
>
> >> If the KEK is ever present in pg process, it presumes that the threat
> >> model being addressed allows its loss if the process is compromised, i.e.
> >> all (past, present, future) security properties are void once the process
> >> is compromised.
> >
> > Why we should not put KEK in pg process but it's okay for other
> > processes?
>
> My point is "elsewhere".
>
> Indeed, it could be on another process on the same host, in which case I'd
> rather have the process run under a different uid, which means another
> compromission would be required if pg is compromissed locally ; it could
> also be in a process on another host ; it could be on some special
> hardware. Your imagination is the limit.
>
> > I guess you're talking about a threat when a malicious user logged in OS
> > (or at least accessible) but I thought there is no difference between pg
> > process and other processes in terms of the process being compromised.
>
> Processes are isolated based on uid, unless root is compromised. Once a id
> is compromised (eg "postgres"), the hacker has basically access to all
> files and processes accessible to that id.
>
> > So the solution, in that case, would be to outsource
> > encryption/decryption to external servers as Bruce mentioned.
>
> Hosting stuff (keys, encryption) on another server is indeed an option if
> "elsewhere" is implemented.

If I understand your idea correctly we put both DEK and KEK
"elsewhere", and a postgres process gets only DEK from it. It seems to
me this idea assumes that the place storing encryption keys employees
2-tire key hierarchy or similar thing. What if the user wants to or
has to manage a single encryption key? For example, storing an
encryption key for PostgreSQL TDE into a file in a safe server instead
of KMS using DEK and KEK because of budgets or requirements whatever.
In this case, if the user does key rotation, that encryption key would
need to be rotated, resulting in the user would need to re-encrypt all
database data encrypted with old key. It should work but what do you
think about how postgres does key rotation and re-encryption?

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

77 matches

Mail list logo