Re: Physical replication slot advance is not persistent

2020-01-29 Thread Kyotaro Horiguchi
At Wed, 29 Jan 2020 15:45:56 +0900, Michael Paquier  wrote 
in 
> On Tue, Jan 28, 2020 at 06:06:06PM +0300, Alexey Kondratov wrote:
> > On 28.01.2020 15:14, Kyotaro Horiguchi wrote:
> >> But the doc part looks a bit too detailed to me. Couldn't we explain
> >> that without the word 'dirty'?
..
> >> and it will not be moved beyond the current insert location. Returns
> >> name of the slot and real position to which it was advanced to. The
> >> information of the updated slot is scheduled to be written out at the
> >> follow-up checkpoint if any advancing is done. In the event of a
> >> crash, the slot may return to an earlier position.
> > 
> > Just searched through the *.sgml files, we already use terms 'dirty' and
> > 'flush' applied to writing out pages during checkpoints. Here we are trying
> > to describe the very similar process, but in relation to replication slots,
> > so it looks fine for me. In the same time, the term 'schedule' is used for
> > VACUUM, constraint check or checkpoint itself.
> 
> Honestly, I was a bit on the fence for the term "dirty" when typing
> this paragraph, so I kind of agree with Horiguchi-san's point that it
> could be confusing when applied to replication slots, because there is
> no other reference in the docs about the link between the two
> concepts.  So, I would go for a more simplified sentence for the first
> part, keeping the second sentence intact:
> "The information of the updated slot is written out at the follow-up
> checkpoint if any advancing is done.  In the event of a crash, the
> slot may return to an earlier position."

Looks perfect.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] Global temporary tables

2020-01-29 Thread Konstantin Knizhnik



On 27.01.2020 22:44, Pavel Stehule wrote:


I don't think so compatibility with Oracle is valid point in this 
case. We need GTT, but the mechanism of index building should be 
designed for Postgres, and for users.


Maybe the method proposed by you can be activated by some option like 
CREATE INDEX IMMEDIATELY FOR ALL SESSION. When you use GTT without 
index, then
it should to work some time more, and if you use short life sessions, 
then index build can be last or almost last operation over table and 
can be suboptimal.


Anyway, this behave can be changed later without bigger complications 
- and now I am have strong opinion to prefer don't allow to any DDL 
(with index creation) on any active GTT in other sessions.
Probably your proposal - build indexes on other sessions when GTT is 
touched can share code with just modify metadata and wait on session 
reset or GTT reset


Well, compatibility with Oracle was never treated as important argument 
in this group:)

But I hope that you agree that it real argument against your proposal.
Much more important argument is incompatibility with behavior of regular 
table.
If you propose such incompatibility, then you should have some very 
strong arguments for such behavior which will definitely confuse users.


But I heard only two arguments:

1. Concurrent building of indexes by all backends may consume much 
memory (n_backends * maintenance_work_mem) and consume a lot of disk/CPU 
resources.


First of all it is not completely true. Indexes will be created on 
demand when GTT will be accessed and chance that all sessions will 
become building indexes simultaneously is very small.


But what will happen if we prohibit access to this index for existed 
sessions? If we need index for GTT, then most likely it is used for joins.
If there is no index, then optimizer has to choose some other plan to 
perform this join. For example use hash join. Hash join also requires 
memory,
so if all backends will perform such join simultaneously, then them 
consume (n_backends * work_mem) memory.
Yes, work_mem is used to be smaller than maintenance_work_mem. But in 
any case DBA has a choice to adjust this parameters to avoid this problem.
And in case of your proposal (prohibit access to this index) you give 
him no choice to optimize query execution in existed sessions.


Also if all sessions will simultaneously perform sequential scan of GTT 
instead of building index for it, then them will read the same amount of 
data and consume comparable CPU time.
So prohibiting access to the indexes will not save us from high 
resources consumption if all existed sessions are really actively 
working with this GTT.


2. GTT in one session can contains large amount of data and we need 
index for it, but small amount of data in another session and we do not 
need index for it.


Such situation definitely can happen. But it contradicts to the main 
assumption of GTT use case (that it is accessed in the same way by all 
sessions).
Also I may be agree with this argument if you propose to create indexes 
locally for each sessions.
But your proposal is to prohibit access to the index to the sessions 
which already have populated GTT with data but allow it for sessions 
which have not accessed this GTT yet.
So if some session stores some data in GTT after index was created, then 
it will build index for it, doesn't matter whether size of table is 
small or large.
Why do we make an exception for sessions which already have data in GTT 
in this case?


So from my point of view both arguments are doubtful and can not explain 
why rules of index usability for GTT should be different from regular 
tables.


Usually it is not hard problem to refresh sessions, and what I know 
when you update plpgsql code, it is best practice to refresh session 
early.




I know may systems where session is established once client is connected 
to the system and not closed until client is disconnected.
And any attempt to force termination of the session will cause 
application errors which are not expected by the client.



Sorry, I think that it is principle point in discussion concerning GTT 
design.
Implementation of GTT can be changed in future, but it is bad if 
behavior of GTT will be changed.
It is not clear for me why from the very beginning we should provide 
inconsistent behavior which is even more difficult to implement than 
behavior compatible with regular tables.

And say that in the future it can be changed...

Sorry, but I do not consider proposals to create indexes locally for 
each session (i.e. global tables but private indexes) or use some 
special complicated SQL syntax constructions like
CREATE INDEX IMMEDIATELY FOR ALL SESSION as some real alternatives which 
have to be discussed.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Option to dump foreign data in pg_dump

2020-01-29 Thread Luis Carril
Thanks for working on the comments. I noticed one behavior is
different when --table option is specified. When --table is specified
the following are not getting dumped:
CREATE SERVER foreign_server

I felt the above also should be included as part of the dump when
include-foreign-data option is specified.

Yes, it also happens on master. A dump of a foreign table using --table, which 
only dumps the table definition, does not include the extension nor the server.
I guess that the idea behind --table is that the table prerequisites should 
already exist on the database.

A similar behavior can be reproduced for a non foreign table. If a table is 
created in a specific schema, dumping only the table with --table does not dump 
the schema definition.

So I think we do not need to dump the server with the table.

Cheers

Luis M Carril



Re: Append with naive multiplexing of FDWs

2020-01-29 Thread Kyotaro Horiguchi
Thanks!

At Wed, 29 Jan 2020 14:41:07 +0800, Movead Li  wrote in 
> >"Parallel scan" at the moment means multiple workers fetch unique 
> >blocks from *one* table in an arbitrated manner. In this sense 
> >"parallel FDW scan" means multiple local workers fetch unique bundles 
> >of tuples from *one* foreign table, which means it is running on a 
> >single session.  That doesn't offer an advantage. 
> 
> It maybe not "parallel FDW scan", it can be "parallel shards scan"
> the local workers will pick every foreign partition to scan. I have ever 
> draw a picture about that you can see it in the link below.
> 
> https://www.highgo.ca/2019/08/22/parallel-foreign-scan-of-postgresql/
> 
> I think the "parallel shards scan" make sence in this way.

It is "asynchronous append on async-capable'd postgres-fdw scans". It
could be called as such in the sense that it is intended to be used
with sharding.

> >If parallel query processing worked in worker-per-table mode, 
> >especially on partitioned tables, maybe the current FDW would work 
> >without much of modification. But I believe asynchronous append on 
> >foreign tables on a single process is far resource-effective and 
> >moderately faster than parallel append. 
> 
> As the test result, current patch can not gain more performance when 
> it returns a huge number of tuples. By "parallel shards scan" method,
> it can work well, because the 'parallel' can take full use of CPUs while 
> 'asynchronous' can't. 

Did you looked at my benchmarking result upthread?  Even it gives
significant gain even when gathering large number of tuples from
multiple servers or even from a single server.  It is because of its
asynchronous nature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2020-01-29 Thread Masahiko Sawada
On Sun, 26 Jan 2020 at 01:35, Sehrope Sarkuni  wrote:
>
> Hi,
>
> I took a look at this patch. With some additions I think the feature
> itself is useful but the patch needs more work. It also doesn't have
> any of its own automated tests yet so the testing below was done
> manually.
>
> The attached file, kms_v2.patch, is a rebased version of the
> kms_v1.patch that fixes some bit rot. It sorts some of the Makefile
> additions but otherwise is the original patch. This version applies
> cleanly on master and passes make check.


Thank you for the comments and updating the patch.

>
> I don't have a Windows machine to test it, but I think the Windows
> build files for these changes are missing. The updated
> src/common/Makefile has a comment to coordinate updates to
> Mkvcbuild.pm but I don't see kmgr_utils.c or cipher_openssl.c
> referenced anywhere in there.

Will support Windows building.

>
> The patch adds "pg_kmgr" to the list of files to skip in
> pg_checksums.c but there's no additional "pg_kmgr" file written to the
> data directory. Perhaps that's from a prior version that saved data to
> its own file?

Right, it's unnecessary. Will remove it.

>
> The constant AES128_KEY_LEN is defined in cipher.c but it's not used
> anywhere. RE: AES-128, not sure the value of even supporting it for
> this feature (v.s. just supporting AES-256). Unlike something like
> table data encryption, I'd expect a KMS to be used much less
> frequently so any performance boost of AES-128 vs AES-256 would be
> meaningless.

Ok. I agree to support only AES256 for this feature.

> The functions pg_cipher_encrypt(...), pg_cipher_decrypt(...), and
> pg_compute_HMAC(...) return true if OpenSSL is not configured. Should
> that be false? The ctx init functions all return false when not
> configured. I don't think that code path would ever be reached as you
> would not have a valid context but seems more consistent to have them
> all return false.

Agreed.

>
> There's a comment referring to "Encryption keys (TDEK and WDEK)
> length" but this feature is only for a KMS so that should be renamed.
>

Will remove it.

> The passphrase is hashed to split it into two 32-byte keys but the min
> length is only 8-bytes:
>
> #define KMGR_MIN_PASSPHRASE_LEN 8
>
> ... that should be at least 64-bytes to reflect how it's being used
> downstream. Depending on the format of the passphrase commands output
> it should be even longer (ex: binary data in hex should really be
> double that). The overall min should be 64-byte but maybe add a note
> to the docs to explain how the output will be used and the expected
> amount of entropy.

Agreed.

>
> In pg_kmgr_wrap(...) it checks that the input is a multiple of 8 bytes:
>
> if (datalen % 8 != 0)
> ereport(ERROR,
> (errmsg("input data must be multiple of 8 bytes")));
>
> ...but after testing it, the OpenSSL key wrap functions it invokes
> require a multiple of 16-bytes (block size of AES). Otherwise you get
> a generic error:
>
> # SELECT pg_kmgr_wrap('abcd1234'::bytea);
> ERROR:  could not wrap the given secret

Thank you for testing it. I will follow your suggestion described below.

>
> In ossl_compute_HMAC(...) it refers to AES256_KEY_LEN. Should be
> SHA256_HMAC_KEY_LEN (they're both 32-bytes but naming is wrong)
>
> return HMAC(EVP_sha256(), key, AES256_KEY_LEN, data,
> (uint32) data_size, result, (uint32 *) result_size);

Will fix.

>
> In pg_rotate_encryption_key(...) the error message for short
> passphrases should be "at least %d bytes":
>
> if (passlen < KMGR_MIN_PASSPHRASE_LEN)
> ereport(ERROR,
> (errmsg("passphrase must be more than %d bytes",
> KMGR_MIN_PASSPHRASE_LEN)));

Agreed. Will fix.

>
> Rotating the passphrase via "SELECT pg_rotate_encryption_key()" and
> restarting the server worked (good). Having the server attempt to
> start with invalid output from the command gives an error "FATAL:
> cluster passphrase does not match expected passphrase" (good).
>
> Round tripping via wrap/unwrap works (good!):
>
> # SELECT convert_from(pg_kmgr_unwrap(pg_kmgr_wrap('abcd1234abcd1234'::bytea)),
> 'utf8');
>convert_from
> --
>  abcd1234abcd1234
> (1 row)
>
> Trying to unwrap gibberish fails (also good!):
>
> # SELECT pg_kmgr_unwrap('\x123456789012345678901234567890123456789012345678');
> ERROR:  could not unwrap the given secret
>

Thank you for testing!

> The pg_kmgr_wrap/unwrap functions use EVP_aes_256_wrap()[1] which
> implements RFC 5649[2] with the default IVs so they always return the
> same value for the same input:
>
> # SELECT x, pg_kmgr_wrap('abcd1234abcd1234abcd1234') FROM
> generate_series(1,5) x;
>  x |pg_kmgr_wrap
> ---+
>  1 | \x51041d1fe52916fd15f456c2b67108473d9bf536795e2b6d4db81c065c8cd688
>  2 | \x51041d1fe52916fd15f456c2b67108473d9bf536795e2b6d4db81c065c8cd688
>  3 | \x51041d1fe52916fd1

psqlODBC development

2020-01-29 Thread Robert Willis
Hello,

I am interested in contributing source code changes for psqlODBC.
These changes   cover   (trivial) omissions,  corrections, and potential 
enhancements.

How do I go about this?Is there a specific-mailing list (other than 
this one) for that purpose?
Or is there some other mechanism that is employed?Who/what is the 
final arbiter for actual
code changes that get distributed in releases?

If this information is in  a "readme" somewhere that I have overlooked, 
please let me know.

Robert

  
NOTICE  from Ab Initio: This email (including any attachments) may contain 
information that is subject to confidentiality obligations or is legally 
privileged, and sender does not waive confidentiality or privilege. If received 
in error, please notify the sender, delete this email, and make no further use, 
disclosure, or distribution.  

Re: Autovacuum on partitioned table

2020-01-29 Thread Amit Langote
On Wed, Jan 29, 2020 at 11:29 AM yuzuko  wrote:
> > Besides the complexity of
> > getting that infrastructure in place, an important question is whether
> > the current system of applying threshold and scale factor to
> > changes_since_analyze should be used as-is for inheritance parents
> > (partitioned tables), because if users set those parameters similarly
> > to for regular tables, autovacuum might analyze partitioned tables
> > more than necessary.  We'll either need a different formula, or some
> > commentary in the documentation about how partitioned tables might
> > need different setting, or maybe both.
> >
> I'm not sure but I think we need new autovacuum parameters for
> partitioned tables (autovacuum, autovacuum_analyze_threshold,
> autovacuum_analyze_scale_factor) because whether it's necessary
> to run autovacuum on partitioned tables will depend on users.
> What do you think?

Yes, we will need to first support those parameters on partitioned
tables.  Currently, you get:

create table p (a int) partition by list (a) with
(autovacuum_analyze_scale_factor=0);
ERROR:  unrecognized parameter "autovacuum_analyze_scale_factor"

> > How are you going to track changes_since_analyze of partitioned table?
> > It's just an idea but we can accumulate changes_since_analyze of
> > partitioned table by adding child tables's value after analyzing each
> > child table. And compare the partitioned tables value to the threshold
> > that is computed by (autovacuum_analyze_threshold  + total rows
> > including all child tables * autovacuum_analyze_scale_factor).
> >
> The idea Sawada-san mentioned is similar to mine.

So if I understand this idea correctly, a partitioned table's analyze
will only be triggered when partitions are analyzed.  That is,
inserts, updates, deletes of tuples in partitions will be tracked by
pgstat, which in turn is used by autovacuum to trigger analyze on
partitions.  Then, partitions changes_since_analyze is added into the
parent's changes_since_analyze, which in turn *may* trigger analyze
parent.  I said "may", because it would take multiple partition
analyzes to accumulate enough changes to trigger one on the parent.
Am I getting that right?

>  Also, for tracking
> changes_since_analyze, we have to make partitioned table's statistics.
> To do that, we can invent a new PgStat_StatPartitionedTabEntry based
> on PgStat_StatTabEntry.  Through talking with Amit, I think the new structure
> needs the following members:
>
> tableid
> changes_since_analyze
> analyze_timestamp
> analyze_count
> autovac_analyze_timestamp
> autovac_analyze_count
>
> Vacuum doesn't run on partitioned tables, so I think members related to
> (auto) vacuum need not be contained in the structure.

On second thought, maybe we don't need a new PgStat_ struct.  We can
just use what's used for regular tables and leave the fields that
don't make sense for partitioned tables set to 0, such as those that
track the counts of scans, tuples, etc.  That means we don't have to
mess with interfaces of existing functions, like this one:

static void relation_needs_vacanalyze(Oid relid,
  AutoVacOpts *relopts,
  Form_pg_class classForm,
  PgStat_StatTabEntry *tabentry, ...

Thanks,
Amit




Re: Making psql error out on output failures

2020-01-29 Thread Daniel Verite
David Zhang wrote:

> > Are you sure? I don't find that redefinition. Besides
> > print_aligned_text() also calls putc and puts.
> Yes, below is the gdb debug message when psql first time detects the 
> error "No space left on device". Test case, "postgres=# select 
> repeat('111', 100) \g /mnt/ramdisk/file"
> bt
> #0  flushbuffer (target=0x7ffd6a709ad0) at snprintf.c:313

Indeed. For some reason gdb won't let me step into these fprintf()
calls, but you're right they're redefined (through include/port.h):

#define vsnprintf   pg_vsnprintf
#define snprintfpg_snprintf
#define vsprintfpg_vsprintf
#define sprintf pg_sprintf
#define vfprintfpg_vfprintf
#define fprintf pg_fprintf
#define vprintf pg_vprintf
#define printf(...) pg_printf(__VA_ARGS__)

Anyway, I don't see it leading to an actionable way to reliably keep
errno, as discussed upthread.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite




Re: Append with naive multiplexing of FDWs

2020-01-29 Thread movead...@highgo.ca
Hello,

>It is "asynchronous append on async-capable'd postgres-fdw scans". It
>could be called as such in the sense that it is intended to be used 
>with sharding.
 Yes that's it.

 
>Did you looked at my benchmarking result upthread?  Even it gives
>significant gain even when gathering large number of tuples from
>multiple servers or even from a single server.  It is because of its
>asynchronous nature.
I mean it gain performance at first, but it mets bottleneck while
increase the number of the nodes.
For example:
   It has 2 nodes, it will gain 200% performance.
   It has 3 nodes, it will gain 300% performance.
   However,
   It has 4 nodes, it gain 300% performance.
   It has 5 nodes, it gain 300% performance.
   ...

 

Highgo Software (Canada/China/Pakistan) 
URL : www.highgo.ca 
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca


Re: psqlODBC development

2020-01-29 Thread Christoph Moench-Tegeder
## Robert Willis (rwil...@abinitio.com):

> How do I go about this?Is there a specific-mailing list (other than 
> this one) for that purpose?

https://odbc.postgresql.org/
"psqlODBC is developed and supported through the pgsql-o...@postgresql.org
mailing list."

Regards,
Christoph

-- 
Spare Space




Re: closesocket behavior in different platforms

2020-01-29 Thread vignesh C
On Tue, Jan 21, 2020 at 11:22 AM Amit Kapila  wrote:
>
> On Fri, Dec 6, 2019 at 11:24 AM vignesh C  wrote:
> >
> > It is noticed that in all the 4 cases the message "FATAL:  terminating 
> > connection due to administrator command" does not appear in windows.
> >
> > However the following message is present in the server log file:
> > FATAL:  terminating connection due to administrator command
> >
> > The reason for this looks like:
> > When the server closes a connection, it sends the ErrorResponse packet, and 
> > then closes the socket and terminates the backend process. If the packet is 
> > received before the server closes the connection, the error message is 
> > received in both windows and linux. If the packet is not received before 
> > the server closes the connection, the error message is not received in case 
> > of windows where as in linux it is received.
> >
> > There have been a couple of discussion earlier also on this [1] & [2], but 
> > we could not find any alternate solution.
> >
> > One of the options that msdn suggests in [3] is to use SO_LINGER option, we 
> > had tried this option with no luck in solving. One other thing that we had 
> > tried was to sleep for 1 second before closing the socket, this solution 
> > works if the client is active, whereas in case of inactive clients it does 
> > not solves the problem. One other thought that we had was to simultaneously 
> > check the connection from psql, when we are waiting for query input in 
> > gets_interactive function or have a separate thread to check the connection 
> > status periodically, this might work only in case of psql but will not work 
> > for application which uses libpq. Amit had also suggested one solution in 
> > [4], where he proposed 'I have also tried calling closesocket() explicitly 
> > in our function socket_close which has changed the error message to "could 
> > not receive data from server: Software caused connection abort 
> > (0x2745/10053)".'
> >
>
> Based on previous investigation and information in this email, I don't
> see anything we can do about this.
>
> > Should we add some documentation for the above behavior.
> >
>
> That sounds reasonable to me.  Any proposal for the same?   One idea
> could be to add something like "Client Disconnection Problems" after
> the "Client Connection Problems" section in docs [1].
>

Thanks for your review and suggestion. I have made a patch based on
similar lines. Attached patch has the doc update with the explanation.
Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com
From e8fc0f70c4bb0bdcff8edc748ae7e04a0fcbf7b4 Mon Sep 17 00:00:00 2001
From: vignesh 
Date: Wed, 29 Jan 2020 16:23:51 +0530
Subject: [PATCH] Doc update for the server message not being displayed when
 the client connection is closed by the server.

Doc update for the server message not being displayed when the client
connection is closed by the server in case of
idle_in_transaction_session_timeout, pg_terminate_backend, pg_ctl kill TERM
and drop database with (force). When the server closes client connection, it
sends the ErrorResponse packet, and then closes the socket and terminates the
backend process. If the packet is received before the server closes the
connection, the error message is received in both windows and linux. If the
packet is not received before the server closes the connection, the error
message is not received in case of windows where as in non-windows it is
received.
---
 doc/src/sgml/runtime.sgml | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml
index a34d31d..3a0abd7 100644
--- a/doc/src/sgml/runtime.sgml
+++ b/doc/src/sgml/runtime.sgml
@@ -627,6 +627,36 @@ psql: could not connect to server: No such file or directory
  indicate more fundamental problems, like lack of network
  connectivity.
 
+
+
+ You will get server closed the connection unexpectedly message while
+ trying to execute sql command on disconnected connection. The message is
+ slightly different in windows and non-windows. In non-windows, you will
+ see a FATAL message before the error message:
+
+FATAL:  terminating connection due to idle-in-transaction timeout
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.
+
+ In windows, you might not see the FATAL message:
+
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.
+
+ This message "FATAL:  terminating connection due to idle-in-transaction
+ timeout" that is sent from server will not be displayed in windows,
+ however it will be present in the log file. The reason 

Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-29 Thread Kasahara Tatsuhito
Hi.

Attached patch solve this problem.

This patch adds table_beginscan_tid() and call it in TidListEval()
instead of table_beginscan().
table_beginscan_tid() is the same as table_beginscan() but do not set
SO_TYPE_SEQSCAN to flags.

Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.

Is it worth to apply to HEAD and v12 branch ?

Best regards,

2020年1月27日(月) 14:35 Kasahara Tatsuhito :
>
> Hi, I noticed that since PostgreSQL 12, Tid scan increments value of
> pg_stat_all_tables.seq_scan. (but not seq_tup_read)
>
> The following is an example.
>
> CREATE TABLE t (c int);
> INSERT INTO t SELECT 1;
> SET enable_seqscan to off;
>
> (v12 -)
> =# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
> QUERY PLAN
> ---
>  Tid Scan on t  (cost=0.00..4.01 rows=1 width=4) (actual
> time=0.034..0.035 rows=1 loops=1)
>TID Cond: (ctid = '(0,1)'::tid)
>  Planning Time: 0.341 ms
>  Execution Time: 0.059 ms
> (4 rows)
>
> =# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables WHERE relname = 't';
>  seq_scan | seq_tup_read
> --+--
> 1 |0
> (1 row)
>
> (- v11)
> =# EXPLAIN ANALYZE SELECT * FROM t WHERE ctid = '(0,1)';
> QUERY PLAN
> ---
>  Tid Scan on t  (cost=0.00..4.01 rows=1 width=4) (actual
> time=0.026..0.027 rows=1 loops=1)
>TID Cond: (ctid = '(0,1)'::tid)
>  Planning Time: 1.003 ms
>  Execution Time: 0.068 ms
> (4 rows)
>
> postgres=# SELECT seq_scan, seq_tup_read FROM pg_stat_user_tables
> WHERE relname = 't';
>  seq_scan | seq_tup_read
> --+--
> 0 |0
> (1 row)
>
>
> Exactly, this change occurred from following commit.
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=147e3722f7e531f15ba389a4d518efe8cd0bd736)
> I think, from this commit, TidListEval() came to call
> table_beginscan() , so this increments would be happen.
>
> I'm not sure this change whether intention or not, it can confuse some users.
>
> Best regards,
> --
> NTT Open Source Software Center
> Tatsuhito Kasahara



-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com


fix_tidscan_increments_seqscan_num.patch
Description: Binary data


Re: BUG #16109: Postgres planning time is high across version (Expose buffer usage during planning in EXPLAIN)

2020-01-29 Thread Julien Rouhaud
On Fri, Jan 24, 2020 at 10:06 PM Julien Rouhaud  wrote:
>
> On Fri, Jan 24, 2020 at 6:55 AM Justin Pryzby  wrote:
> >
> > On Wed, Nov 13, 2019 at 11:39:04AM +0100, Julien Rouhaud wrote:
> > > (moved to -hackers)
> > >
> > > On Tue, Nov 12, 2019 at 9:55 PM Andres Freund  wrote:
> > > >
> > > > This last point is more oriented towards other PG developers: I wonder
> > > > if we ought to display buffer statistics for plan time, for EXPLAIN
> > > > (BUFFERS). That'd surely make it easier to discern cases where we
> > > > e.g. access the index and scan a lot of the index from cases where we
> > > > hit some CPU time issue. We should easily be able to get that data, I
> > > > think, we already maintain it, we'd just need to compute the diff
> > > > between pgBufferUsage before / after planning.
> > >
> > > That would be quite interesting to have.  I attach as a reference a
> > > quick POC patch to implement it:
> >
> > +1
> >
> > +   result.shared_blks_hit = stop->shared_blks_hit - 
> > start->shared_blks_hit;
> > +   result.shared_blks_read = stop->shared_blks_read - 
> > start->shared_blks_read;
> > +   result.shared_blks_dirtied = stop->shared_blks_dirtied -
> > +   start->shared_blks_dirtied;
> > [...]
> >
> > I think it would be more readable and maintainable using a macro:
> >
> > #define CALC_BUFF_DIFF(x) result.##x = stop->##x - start->##x
> > CALC_BUFF_DIFF(shared_blks_hit);
> > CALC_BUFF_DIFF(shared_blks_read);
> > CALC_BUFF_DIFF(shared_blks_dirtied);
> > ...
> > #undefine CALC_BUFF_DIFF
>
> Good idea.  Note that you can't use preprocessor concatenation to
> generate something else than a token or a number, so the ## can just
> be removed here.

Rebase due to conflict with 3ec20c7091e97.


show_planning_buffers-v3.diff
Description: Binary data


Re: Autovacuum on partitioned table

2020-01-29 Thread Michael Paquier
On Wed, Jan 29, 2020 at 05:56:40PM +0900, Amit Langote wrote:
> Yes, we will need to first support those parameters on partitioned
> tables.  Currently, you get:
> 
> create table p (a int) partition by list (a) with
> (autovacuum_analyze_scale_factor=0);
> ERROR:  unrecognized parameter "autovacuum_analyze_scale_factor"

Worth the note: partitioned tables support zero reloptions as of now,
but there is the facility in place to allow that (see
RELOPT_KIND_PARTITIONED and partitioned_table_reloptions).
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2020-01-29 Thread Etsuro Fujita
Hi Mark,

On Tue, Jan 28, 2020 at 1:39 PM Mark Dilger
 wrote:
> There is stray whitespace in v30-0002:
>
> src/backend/partitioning/partbounds.c:4557: space before tab in indent.
> +   outer_null_unmerged = true;

Good catch!

> I have added tests checking correctness and showing some partition pruning 
> limitations.  Find three patches, attached.
>
> The v31-0001-… patch merely applies your patches as a starting point for the 
> next two patches.  It is your work, not mine.
>
> The v31-0002-… patch adds more regression tests for range partitioning.  The 
> commit message contains my comments about that.
>
> The v31-0003-… patch adds more regression tests for list partitioning, and 
> again, the commit message contains my comments about that.

The PWJ behavior shown by the tests you added looks interesting!  I'll
dig into it more closely.  Thanks for the patches and review!

Best regards,
Etsuro Fujita




Re: [Proposal] Global temporary tables

2020-01-29 Thread 曾文旌(义从)


> 2020年1月29日 上午1:54,Pavel Stehule  写道:
> 
> 
> 
> út 28. 1. 2020 v 18:13 odesílatel Pavel Stehule  > napsal:
> 
> 
> út 28. 1. 2020 v 18:12 odesílatel 曾文旌(义从)  > napsal:
> 
> 
>> 2020年1月29日 上午12:40,Pavel Stehule > > 写道:
>> 
>> 
>> 
>> út 28. 1. 2020 v 17:01 odesílatel 曾文旌(义从) > > napsal:
>> 
>> 
>>> 2020年1月24日 上午4:47,Robert Haas >> > 写道:
>>> 
>>> On Sat, Jan 11, 2020 at 8:51 PM Tomas Vondra
>>> mailto:tomas.von...@2ndquadrant.com>> wrote:
 I proposed just ignoring those new indexes because it seems much simpler
 than alternative solutions that I can think of, and it's not like those
 other solutions don't have other issues.
>>> 
>>> +1.
>> I complete the implementation of this feature.
>> When a session x create an index idx_a on GTT A then
>> For session x, idx_a is valid when after create index.
>> For session y, before session x create index done, GTT A has some data, then 
>> index_a is invalid.
>> For session z, before session x create index done, GTT A has no data, then 
>> index_a is valid.
>> 
>>> 
 For example, I've looked at the "on demand" building as implemented in
 global_private_temp-8.patch, I kinda doubt adding a bunch of index build
 calls into various places in index code seems somewht suspicious.
>>> 
>>> +1. I can't imagine that's a safe or sane thing to do.
>>> 
>>> -- 
>>> Robert Haas
>>> EnterpriseDB: http://www.enterprisedb.com 
>>> The Enterprise PostgreSQL Company
>> 
>> Opinion by Pavel
>> +rel->rd_islocaltemp = true;  <<< if this is valid, then the name of 
>> field "rd_islocaltemp" is not probably best
>> I renamed rd_islocaltemp
>> 
>> I don't see any change?
> Rename rd_islocaltemp to rd_istemp  in global_temporary_table_v8-pg13.patch
> 
> ok :)
> 
> I found a bug
> 
> postgres=# create global temp table x(a int);
> CREATE TABLE
> postgres=# insert into x values(1);
> INSERT 0 1
> postgres=# create index on x (a);
> CREATE INDEX
> postgres=# create index on x((a + 1));
> CREATE INDEX
> postgres=# analyze x;
> WARNING:  oid 16468 not a relation
> ANALYZE
Thanks for review.

The index expression need to store statistics on index, I missed it and I'll 
fix it later.


Wenjing

> 
> other behave looks well for me.
> 
> Regards
> 
> Pavel
> 
> 
> Pavel
> 
> 
> Wenjing
> 
> 
> 
>> 
>> 
>> 
>> Opinion by Konstantin Knizhnik
>> 1 Fixed comments
>> 2 Fixed assertion
>> 
>> 
>> Please help me review.
>> 
>> 
>> Wenjing
>> 
> 



Re: [Proposal] Global temporary tables

2020-01-29 Thread Robert Haas
On Mon, Jan 27, 2020 at 4:11 AM Konstantin Knizhnik
 wrote:
> I do not see any reasons to allow build local indexes for global table. 
> Yes,it can happen that some session will have small amount of data in 
> particular GTT and another - small amount of data in this table. But if 
> access pattern is the same  (and nature of GTT assumes it), then index in 
> either appreciate, either useless in both cases.

I agree. I think allowing different backends to have different indexes
is overly complicated.

Regarding another point that was raised, I think it's not a good idea
to prohibit DDL on global temporary tables altogether. It should be
fine to change things when no sessions are using the GTT. Going
further and allowing changes when there are attached sessions would be
nice, but I think we shouldn't try. Getting this feature committed is
going to be a huge amount of work with even a minimal feature set;
complicating the problem by adding what are essentially new
DDL-concurrency features on top of the basic feature seems very much
unwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: standby apply lag on inactive servers

2020-01-29 Thread Alvaro Herrera
On 2020-Jan-29, Kyotaro Horiguchi wrote:

> But as more significant issue, nowadays PostgreSQL doesn't run a
> checkpoint if it is really inactive (that is, if no "important" WAL
> records have issued.).

Yeah, I mentioned this in message
<20200127203419.GA15216@alvherre.pgsql>.  The solution for monitoring
purposes is to use the new "lag" columns in pg_stat_replication.  But
that's not available in older releases (prior to 10), so this change is
still useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Global temporary tables

2020-01-29 Thread Robert Haas
On Tue, Jan 28, 2020 at 12:12 PM 曾文旌(义从)  wrote:
>> Opinion by Pavel
>> + rel->rd_islocaltemp = true;  <<< if this is valid, then the name of 
>> field "rd_islocaltemp" is not probably best
>> I renamed rd_islocaltemp
>
> I don't see any change?
>
> Rename rd_islocaltemp to rd_istemp  in global_temporary_table_v8-pg13.patch

In view of commit 6919b7e3294702adc39effd16634b2715d04f012, I think
that this has approximately a 0% chance of being acceptable. If you're
setting a field in a way that is inconsistent with the current use of
the field, you're probably doing it wrong, because the field has an
existing purpose to which new code must conform. And if you're not
doing that, then you don't need to rename it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Block level parallel vacuum

2020-01-29 Thread Masahiko Sawada
On Tue, 28 Jan 2020 at 18:47, Amit Kapila  wrote:
>
> On Tue, Jan 28, 2020 at 12:53 PM Mahendra Singh Thalor
>  wrote:
> >
> > > > > > 1.
> > > > > > -P, --parallel=PARALLEL_DEGREE  do parallel vacuum
> > > > > >
> > > > > > I think, "do parallel vacuum" should be modified. Without 
> > > > > > specifying -P, we are still doing parallel vacuum so we can use 
> > > > > > like "degree for parallel vacuum"
> > > > > >
> > > > >
> > > > > I am not sure if 'degree' makes it very clear.  How about "use this
> > > > > many background workers for vacuum, if available"?
> > > >
> > > > If background workers are many, then automatically, we are using 
> > > > them(by default parallel vacuum). This option is to put limit on 
> > > > background workers(limit for vacuum workers) to be used by vacuum 
> > > > process.
> > > >
> > >
> > > I don't think that the option is just to specify the max limit because
> > > that is generally controlled by guc parameters.  This option allows
> > > users to specify the number of workers for the cases where he has more
> > > knowledge about the size/type of indexes.  In some cases, the user
> > > might be able to make a better decision and that was the reason we
> > > have added this option in the first place.
> > >
> > > > So I think, we can use "max parallel vacuum workers (by default, based 
> > > > on no. of indexes)" or "control parallel vacuum workers"
> > > >
> > >
> > > Hmm, I feel what I suggested is better because of the above explanation.
> >
> > Agreed.
> >
>
> Okay, thanks for the review.  Attached is an updated patch. I have
> additionally run pgindent.  I am planning to commit the attached
> tomorrow unless I see more comments.

Thank you for committing it!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-01-29 Thread Fujii Masao

Hi,

pg_basebackup reports the backup progress if --progress option is specified,
and we can monitor it in the client side. I think that it's useful if we can
monitor the progress information also in the server side because, for example,
we can easily check that by using SQL. So I'd like to propose
pg_stat_progress_basebackup view that allows us to monitor the progress
of pg_basebackup, in the server side. Thought?

POC patch is attached.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8839699079..136fcbc2af 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -376,6 +376,14 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   
11:34   0:00 postgres: ser
   
  
 
+ 
+  
pg_stat_progress_basebackuppg_stat_progress_basebackup
+  One row for each WAL sender process streaming a base backup,
+   showing current progress.
+   See .
+  
+ 
+
 

   
@@ -3515,7 +3523,10 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,
certain commands during command execution.  Currently, the only commands
which support progress reporting are ANALYZE,
CLUSTER,
-   CREATE INDEX, and VACUUM.
+   CREATE INDEX, VACUUM,
+   and  (i.e., replication
+   command that  issues to take
+   a base backup).
This may be expanded in the future.
   
 
@@ -4316,6 +4327,143 @@ SELECT pg_stat_get_backend_pid(s.backendid) AS pid,


   
+ 
+
+ 
+  Base Backup Progress Reporting
+
+  
+   Whenever pg_basebackup is taking a base
+   backup, the pg_stat_progress_basebackup
+   view will contain a row for each WAL sender process that is currently
+   running BASE_BACKUP replication command
+   and streaming the backup. The tables below describe the information
+   that will be reported and provide information about how to interpret it.
+  
+
+  
+   pg_stat_progress_basebackup View
+   
+
+
+  Column
+  Type
+  Description
+ 
+
+
+   
+
+ pid
+ integer
+ Process ID of a WAL sender process.
+
+
+ phase
+ text
+ Current processing phase. See .
+
+
+ backup_total
+ bigint
+ 
+  Total amount of data that will be streamed. If progress reporting
+  is not enabled in pg_basebackup
+  (i.e., --progress option is not specified),
+  this is 0. Otherwise, this is estimated and
+  reported as of the beginning of streaming backup
+  phase. Note that this is only an approximation since the database
+  may change during streaming backup phase
+  and WAL log may be included in the backup later. This is always
+  the same value as backup_streamed
+  once the amount of data already streamed exceeds the estimated
+  total size.
+ 
+
+
+ backup_streamed
+ bigint
+ 
+  Amount of data already streamed. This counter only advances
+  when the phase is streaming backup or
+  transfering wal.
+ 
+
+
+ tablespace_total
+ bigint
+ 
+  Total number of tablespaces that will be streamed.
+ 
+
+
+ tablespace_streamed
+ bigint
+ 
+  Number of tablespaces already streamed. This counter only
+  advances when the phase is streaming backup.
+ 
+
+   
+   
+  
+
+  
+   Base backup phases
+   
+
+ 
+  Phase
+  Description
+ 
+
+
+ 
+  initializing
+  
+   The WAL sender process is preparing to begin the backup.
+   This phase is expected to be very brief.
+  
+ 
+ 
+  starting backup
+  
+   The WAL sender process is currently performing
+   pg_start_backup and setting up for
+   making a base backup.
+  
+ 
+ 
+  streaming backup
+  
+   The WAL sender process is currently streaming a base backup.
+  
+ 
+ 
+  stopping backup
+  
+   The WAL sender process is currently performing
+   pg_stop_backup and finishing the backup.
+   If either --wal-method=none or
+   --wal-method=stream is specified in
+   pg_basebackup, the backup will end
+   when this phase is completed.
+  
+ 
+ 
+  transferring wal
+  
+   The WAL sender process is currently transferring all WAL logs
+   generated during the backup. This phase occurs after
+   stopping backup phase if
+   --wal-method=fetch is specified in
+   pg_basebackup. The backup will end
+   when this phase is completed.
+  
+ 
+
+   
+  
 
  
  
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 80275215e0..f139ba0231 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -2465,7 +2465,7 @@ The commands accepted in replication mode are:
 
   
 
-  
+  
 BASE_BACKUP [ LABEL 
'label' ] [ PROGRESS ] [ 
FAST ] [ WAL ] [ 
NO

Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-29 Thread Fujii Masao




On 2020/01/29 20:06, Kasahara Tatsuhito wrote:

Hi.

Attached patch solve this problem.

This patch adds table_beginscan_tid() and call it in TidListEval()
instead of table_beginscan().
table_beginscan_tid() is the same as table_beginscan() but do not set
SO_TYPE_SEQSCAN to flags.

Although I'm not sure this behavior is really problem or not,
it seems to me that previous behavior is more prefer.

Is it worth to apply to HEAD and v12 branch ?


I've not read the patch yet, but I agree that updating only seq_scan
but not seq_tup_read in Tid Scan sounds strange. IMO at least
both should be update together or neither should be updated.

Regards,

--
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters




Re: [Proposal] Global temporary tables

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 3:13 AM Konstantin Knizhnik
 wrote:
> But I heard only two arguments:
>
> 1. Concurrent building of indexes by all backends may consume much memory 
> (n_backends * maintenance_work_mem) and consume a lot of disk/CPU resources.
> 2. GTT in one session can contains large amount of data and we need index for 
> it, but small amount of data in another session and we do not need index for 
> it.

You seem to be ignoring the fact that two committers told you this
probably wasn't safe.

Perhaps your view is that those people made no argument, and therefore
you don't have to respond to it. But the onus is not on somebody else
to tell you why a completely novel idea is not safe. The onus is on
you to analyze it in detail and prove that it is safe. What you need
to show is that there is no code anywhere in the system which will be
confused by an index springing into existence at whatever time you're
creating it.

One problem is that there are various backend-local data structures in
the relcache, the planner, and the executor that remember information
about indexes, and that may not respond well to having more indexes
show up unexpectedly. On the one hand, they might crash; on the other
hand, they might ignore the new index when they shouldn't. Another
problem is that the code which creates indexes might fail or misbehave
when run in an environment different from the one in which it
currently runs. I haven't really studied your code, so I don't know
exactly what it does, but for example it would be really bad to try to
build an index while holding a buffer lock, both because it might
cause (low-probability) undetected deadlocks and also because it might
block another process that wants that buffer lock in a
non-interruptible wait state for a long time.

Now, maybe you can make an argument that you only create indexes at
points in the query that are "safe." But I am skeptical, because of
this example:

rhaas=# create table foo (a int primary key, b text, c text, d text);
CREATE TABLE
rhaas=# create function blump() returns trigger as $$begin create
index on foo (b); return new; end$$ language plpgsql;
CREATE FUNCTION
rhaas=# create trigger thud before insert on foo execute function blump();
CREATE TRIGGER
rhaas=# insert into foo (a) select generate_series(1,10);
ERROR:  cannot CREATE INDEX "foo" because it is being used by active
queries in this session
CONTEXT:  SQL statement "create index on foo (b)"
PL/pgSQL function blump() line 1 at SQL statement

That prohibition is there for some reason. Someone did not just decide
to arbitrarily prohibit it. A CREATE INDEX command run in that context
won't run afoul of many of the things that might be problems in other
places -- e.g. there won't be a buffer lock held. Yet, despite the
fact that a trigger context is safe for executing a wide variety of
user-defined code, this particular operation is not allowed here. That
is the sort of thing that should worry you.

At any rate, even if this somehow were or could be made safe,
on-the-fly index creation is a feature that cannot and should not be
combined with a patch to implement global temporary tables. Surely, it
will require a lot of study and work to get the details right. And so
will GTT. As I said in the other email I wrote, this feature is hard
enough without adding this kind of thing to it. There's a reason why I
never got around to implementing this ten years ago when I did
unlogged tables; I was intending that to be a precursor to the GTT
work. I found that it was too hard and I gave up. I'm glad to see
people trying again, but the idea that we can afford to add in extra
features, or frankly that either of the dueling patches on this thread
are close to committable, is just plain wrong.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: closesocket behavior in different platforms

2020-01-29 Thread amul sul
On Wed, Jan 29, 2020 at 4:34 PM vignesh C  wrote:

> On Tue, Jan 21, 2020 at 11:22 AM Amit Kapila 
> wrote:
> >
> > On Fri, Dec 6, 2019 at 11:24 AM vignesh C  wrote:
> > >

[...]

>
> Thanks for your review and suggestion. I have made a patch based on
> similar lines. Attached patch has the doc update with the explanation.
> Thoughts?
>

Hi Vignesh,

I have looked into the patch, realised that some format tagging and
the grammar changes are needed. Commented inline below:

+
+
+ You will get server closed the connection unexpectedly message while

The error message is usually wrapped inside the  tag.

+ trying to execute sql command on disconnected connection. The message
is

Also, s/disconnected connection/the disconnected connection

+ slightly different in windows and non-windows. In non-windows, you
will

s/in windows/on Windows
s/In non-windows/On non-windows

+ see a FATAL message before the error message:

How about : On non-window you'll see a fatal error as below.

+
+FATAL:  terminating connection due to idle-in-transaction timeout
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.
+
+ In windows, you might not see the FATAL message:

s/In windows /On Windows
s/FATAL message/fatal error

+
+server closed the connection unexpectedly
+This probably means the server terminated abnormally
+before or while processing the request.
+The connection to the server was lost. Attempting reset: Succeeded.
+
+ This message "FATAL:  terminating connection due to
idle-in-transaction

Usually  for doubt-quoting is used. Here think, we should remove FATA
L
and wrap the error message text inside  tag.

+ timeout" that is sent from server will not be displayed in windows,

How about : that is sent from the server will not be displayed on windows.

+ however it will be present in the log file. The reason for this is, in

s/however/However
s/in/on

+ windows the client cannot receive the message sent by the server when
the

s/windows/Windows or Windows

+ server has closed the client connection. This behavior can be noticed
when
+ the client connection has been disconnected because of
+ idle_in_transaction_session_timeout, pg_terminate_backend, pg_ctl kill
+ TERM and drop database with (force).

s/idle_in_transaction_session_timeout/
s/pg_terminate_backend/pg_terminate_backend()
s/pg_ctl kill TERM/DROP DATABASE ... WITH ( FORCE )

Regards,
Amul


Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Robert Haas
On Thu, Jan 16, 2020 at 3:05 PM Peter Geoghegan  wrote:
> The main reason that I am confident about unique indexes is that we
> only do a deduplication pass in a unique index when we observe that
> the incoming tuple (the one that might end up splitting the page) is a
> duplicate of some existing tuple. Checking that much is virtually
> free, since we already have the information close at hand today (we
> cache the _bt_check_unique() binary search bounds for reuse within
> _bt_findinsertloc() today). This seems to be an excellent heuristic,
> since we really only want to target unique index leaf pages where all
> or almost all insertions must be duplicates caused by non-HOT updates
> -- this category includes all the pgbench indexes, and includes all of
> the unique indexes in TPC-C. Whereas with non-unique indexes, we
> aren't specifically targeting version churn (though it will help with
> that too).

This (and the rest of the explanation) don't really address my
concern. I understand that deduplicating in lieu of splitting a page
in a unique index is highly likely to be a win. What I don't
understand is why it shouldn't just be a win, period. Not splitting a
page seems like it has a big upside regardless of whether the index is
unique -- and in fact, the upside could be a lot bigger for a
non-unique index. If the coarse-grained LP_DEAD thing is the problem,
then I can grasp that issue, but you don't seem very worried about
that.

Generally, I think it's a bad idea to give the user an "emergency off
switch" and then sometimes ignore it. If the feature seems to be
generally beneficial, but you're worried that there might be
regressions in obscure cases, then turn it on by default, and give the
user the ability to forcibly turn it off. But don't give the the
opportunity to forcibly turn it off sometimes. Nobody's going to run
around setting a reloption just for fun -- they're going to do it
because they hit a problem.

I guess I'm also saying here that a reloption seems like a much better
idea than a GUC. I don't see much reason to believe that a system-wide
setting will be useful.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: closesocket behavior in different platforms

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 6:04 AM vignesh C  wrote:
> Thanks for your review and suggestion. I have made a patch based on
> similar lines. Attached patch has the doc update with the explanation.
> Thoughts?

Documenting this doesn't seem very useful to me. If we could fix the
code, that would be useful, but otherwise I think I'd just do nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: BufFileRead() error signalling

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 1:26 AM Michael Paquier  wrote:
> On Tue, Jan 28, 2020 at 03:51:54PM -0500, Robert Haas wrote:
> > I quickly reread that thread and I don't see that there's any firm
> > consensus there in favor of "read %d of %zu" over "read only %d of %zu
> > bytes". Now, if most people prefer the former, so be it, but I don't
> > think that's clear from that thread.
>
> The argument of consistency falls in favor of the former on HEAD:
> $ git grep "could not read" | grep "read %d of %zu" | wc -l
> 59
> $ git grep "could not read" | grep "read only %d of %zu" | wc -l
> 0

True. I didn't realize that 'read %d of %zu' was so widely used.

Your grep misses one instance of 'read only %d of %d bytes' because
you grepped for %zu specifically, but that doesn't really change the
overall picture.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pause recovery if pitr target not reached

2020-01-29 Thread Peter Eisentraut

On 2020-01-28 06:01, Kyotaro Horiguchi wrote:

Hello.

At Mon, 27 Jan 2020 12:16:02 +0100, Peter Eisentraut 
 wrote in

On 2020-01-15 05:02, Kyotaro Horiguchi wrote:

FWIW, I restate this (perhaps) more clearly.
At Wed, 15 Jan 2020 11:02:24 +0900 (JST), Kyotaro Horiguchi
 wrote in

recvoery_target_* is not cleared after startup. If a server crashed
just after the last shutdown checkpoint, any recovery_target_* setting
prevents the server from starting regardless of its value.

recvoery_target_* is not automatically cleared after a successful
archive recovery.  After that, if the server crashed just after the
last shutdown checkpoint, any recovery_target_* setting prevents the
server from starting regardless of its value.


Thank you for this clarification.  Here is a new patch that addresses
that and also the other comments raised about my previous patch.


The code looks fine, renaming reachedStopPoint to
reachedRecoveryTarget looks very nice. Doc part looks fine, too.


PostgresNode.pm
+Is has_restoring is used, standby mode is used by default.  To use

"Is has_restoring used,", or "If has_restoring is used," ?


Committed with that fix.


The change seems aiming not to break compatibility with external test
scripts, but it looks quite confusing to me.  The problem here is both
enable_streaming/restoring independently put trigger files, so don't
we separte placing of trigger files out of the functions?


Yeah, this is all historically grown, but a major refactoring seems out 
of scope for this thread.  It seems hard to come up with a more elegant 
way, since after all the underlying mechanisms are also all intertwined. 
 Your patch adds even more code, so I'm not sure it's an improvement.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Robert Haas
On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
 wrote:
> I merged these a bit.  See v7-0001 for details.

I jiggered that a bit more and committed this. I couldn't see the
point of having both the FRONTEND and non-FRONTEND code include
pg_wchar.h.

I'll wait to see what you make of Andrew's latest comments before
doing anything further.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [Proposal] Global temporary tables

2020-01-29 Thread Konstantin Knizhnik




On 29.01.2020 17:47, Robert Haas wrote:

On Wed, Jan 29, 2020 at 3:13 AM Konstantin Knizhnik
 wrote:

But I heard only two arguments:

1. Concurrent building of indexes by all backends may consume much memory 
(n_backends * maintenance_work_mem) and consume a lot of disk/CPU resources.
2. GTT in one session can contains large amount of data and we need index for 
it, but small amount of data in another session and we do not need index for it.

You seem to be ignoring the fact that two committers told you this
probably wasn't safe.

Perhaps your view is that those people made no argument, and therefore
you don't have to respond to it. But the onus is not on somebody else
to tell you why a completely novel idea is not safe. The onus is on
you to analyze it in detail and prove that it is safe. What you need
to show is that there is no code anywhere in the system which will be
confused by an index springing into existence at whatever time you're
creating it.

One problem is that there are various backend-local data structures in
the relcache, the planner, and the executor that remember information
about indexes, and that may not respond well to having more indexes
show up unexpectedly. On the one hand, they might crash; on the other
hand, they might ignore the new index when they shouldn't. Another
problem is that the code which creates indexes might fail or misbehave
when run in an environment different from the one in which it
currently runs. I haven't really studied your code, so I don't know
exactly what it does, but for example it would be really bad to try to
build an index while holding a buffer lock, both because it might
cause (low-probability) undetected deadlocks and also because it might
block another process that wants that buffer lock in a
non-interruptible wait state for a long time.

Now, maybe you can make an argument that you only create indexes at
points in the query that are "safe." But I am skeptical, because of
this example:

rhaas=# create table foo (a int primary key, b text, c text, d text);
CREATE TABLE
rhaas=# create function blump() returns trigger as $$begin create
index on foo (b); return new; end$$ language plpgsql;
CREATE FUNCTION
rhaas=# create trigger thud before insert on foo execute function blump();
CREATE TRIGGER
rhaas=# insert into foo (a) select generate_series(1,10);
ERROR:  cannot CREATE INDEX "foo" because it is being used by active
queries in this session
CONTEXT:  SQL statement "create index on foo (b)"
PL/pgSQL function blump() line 1 at SQL statement

That prohibition is there for some reason. Someone did not just decide
to arbitrarily prohibit it. A CREATE INDEX command run in that context
won't run afoul of many of the things that might be problems in other
places -- e.g. there won't be a buffer lock held. Yet, despite the
fact that a trigger context is safe for executing a wide variety of
user-defined code, this particular operation is not allowed here. That
is the sort of thing that should worry you.

At any rate, even if this somehow were or could be made safe,
on-the-fly index creation is a feature that cannot and should not be
combined with a patch to implement global temporary tables. Surely, it
will require a lot of study and work to get the details right. And so
will GTT. As I said in the other email I wrote, this feature is hard
enough without adding this kind of thing to it. There's a reason why I
never got around to implementing this ten years ago when I did
unlogged tables; I was intending that to be a precursor to the GTT
work. I found that it was too hard and I gave up. I'm glad to see
people trying again, but the idea that we can afford to add in extra
features, or frankly that either of the dueling patches on this thread
are close to committable, is just plain wrong.



Sorry, I really didn't consider statements containing word "probably" as 
arguments.
But I agree with you: it is task of developer of new feature to prove 
that proposed approach is safe rather than of reviewers to demonstrate 
that it is unsafe.

Can I provide such proof now? I afraid that not.
But please consider two arguments:

1. Index for GTT in any case has to be initialized on demand. In case of 
regular tables index is initialized at the moment of its creation. In 
case of GTT it doesn't work.
So we should somehow detect that accessed index is not initialized and 
perform lazy initialization of the index.
The only difference with the approach proposed by Pavel  (allow index 
for empty GTT but prohibit it for GTT filled with data) is whether we 
also need to populate index with data or not.
I can imagine that implicit initialization of index in read-only query 
(select) can be unsafe and cause some problems. I have not encountered 
such problems yet after performing many tests with GTTs, but certainly I 
have not covered all possible scenarios (not sure that it is possible at 
all).
But I do not understand how populating  index with data can add some 
extra unsafety.



Re: doc: vacuum full, fillfactor, and "extra space"

2020-01-29 Thread Peter Eisentraut

On 2020-01-20 06:30, Justin Pryzby wrote:

Rebased against 40d964ec997f64227bc0ff5e058dc4a5770a70a9


I'm not sure that description of parallel vacuum in the middle of 
non-full vs. full vacuum is actually that good.  I think those sentences 
should be moved to a separate paragraph.


About your patch, I don't think this is clearer.  The fillfactor stuff 
is valid to be mentioned, but the way it's being proposed makes it sound 
like the main purpose of VACUUM FULL is to bloat the table to make 
fillfactor room.  The "no extra space" wording made sense to me, with 
the fillfactor business perhaps worth being put into a parenthetical note.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Tom Lane
Robert Haas  writes:
> On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
>  wrote:
>> I merged these a bit.  See v7-0001 for details.

> I jiggered that a bit more and committed this. I couldn't see the
> point of having both the FRONTEND and non-FRONTEND code include
> pg_wchar.h.

First buildfarm report is not positive:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-01-29%2015%3A30%3A26

  json.obj : error LNK2019: unresolved external symbol 
makeJsonLexContextCstringLen referenced in function json_recv 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonb.obj : error LNK2001: unresolved external symbol 
makeJsonLexContextCstringLen 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2001: unresolved external symbol 
makeJsonLexContextCstringLen 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2019: unresolved external symbol json_lex referenced in 
function json_typeof 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2019: unresolved external symbol IsValidJsonNumber 
referenced in function datum_to_json 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  json.obj : error LNK2001: unresolved external symbol nullSemAction 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2019: unresolved external symbol pg_parse_json 
referenced in function json_strip_nulls 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2019: unresolved external symbol 
json_count_array_elements referenced in function get_array_start 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  jsonfuncs.obj : error LNK2019: unresolved external symbol json_errdetail 
referenced in function json_ereport_error 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
  .\Release\postgres\postgres.exe : fatal error LNK1120: 7 unresolved externals 
[c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]



regards, tom lane




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 10:45 AM Tom Lane  wrote:
> Robert Haas  writes:
> > On Tue, Jan 28, 2020 at 5:35 PM Mark Dilger
> >  wrote:
> >> I merged these a bit.  See v7-0001 for details.
>
> > I jiggered that a bit more and committed this. I couldn't see the
> > point of having both the FRONTEND and non-FRONTEND code include
> > pg_wchar.h.
>
> First buildfarm report is not positive:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dory&dt=2020-01-29%2015%3A30%3A26
>
>   json.obj : error LNK2019: unresolved external symbol 
> makeJsonLexContextCstringLen referenced in function json_recv 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonb.obj : error LNK2001: unresolved external symbol 
> makeJsonLexContextCstringLen 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2001: unresolved external symbol 
> makeJsonLexContextCstringLen 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2019: unresolved external symbol json_lex referenced in 
> function json_typeof 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2019: unresolved external symbol IsValidJsonNumber 
> referenced in function datum_to_json 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   json.obj : error LNK2001: unresolved external symbol nullSemAction 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2019: unresolved external symbol pg_parse_json 
> referenced in function json_strip_nulls 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2019: unresolved external symbol 
> json_count_array_elements referenced in function get_array_start 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   jsonfuncs.obj : error LNK2019: unresolved external symbol json_errdetail 
> referenced in function json_ereport_error 
> [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]
>   .\Release\postgres\postgres.exe : fatal error LNK1120: 7 unresolved 
> externals [c:\pgbuildfarm\pgbuildroot\HEAD\pgsql.build\postgres.vcxproj]

Hrrm, OK. I think it must need a sprinkling of Windows-specific magic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 10:48 AM Robert Haas  wrote:
> Hrrm, OK. I think it must need a sprinkling of Windows-specific magic.

I see that the patch Andrew posted earlier adjusts Mkvcbuild.pm's
@pgcommonallfiles, so I pushed that fix. The other hunks there should
go into the patch to add a test_json utility, I think. Hopefully that
will fix it, but I guess we'll see.

I was under the impression that the MSVC build gets the list of files
to build by parsing the Makefiles, but I guess that's not true at
least in the case of src/common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Option to dump foreign data in pg_dump

2020-01-29 Thread Peter Eisentraut

On 2020-01-21 10:36, Luis Carril wrote:

Yes we can support --include-foreign-data without parallel option and
later add support for parallel option as a different patch.


Hi,

     I've attached a new version of the patch in which an error is 
emitted if the parallel backup is used with the --include-foreign-data 
option.


This seems like an overreaction.  The whole point of 
lockTableForWorker() is to avoid deadlocks, but foreign tables don't 
have locks, so it's not a problem.  I think you can just skip foreign 
tables in lockTableForWorker() using the same logic that getTables() uses.


I think parallel data dump would be an especially interesting option 
when using foreign tables, so it's worth figuring this out.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [Proposal] Global temporary tables

2020-01-29 Thread Pavel Stehule
2. Actually I do not propose some completely new approach. I try to
> provide behavior with is compatible with regular tables.
> If you create index for regular table, then it can be used in all
> sessions, right?
>

I don't understand to this point. Regular tables shares data, shares files.
You cannot to separate it. More - you have to uses relatively aggressive
locks to be this operation safe.

Nothing from these points are valid for GTT.

Regards

Pavel


> And all "various backend-local data structures in the relcache, the
> planner, and the executor that remember information about indexes"
> have to be properly updated.  It is done using invalidation mechanism.
> The same mechanism is used in case of DDL operations with GTT, because
> we change system catalog.
>
> So my point here is that creation index of GTT is almost the same as
> creation of index for regular tables and the same mechanism will be used
> to provide correctness of this operation.
>
>
> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: [Proposal] Global temporary tables

2020-01-29 Thread Konstantin Knizhnik



On 29.01.2020 20:08, Pavel Stehule wrote:




2. Actually I do not propose some completely new approach. I try to
provide behavior with is compatible with regular tables.
If you create index for regular table, then it can be used in all
sessions, right?


I don't understand to this point. Regular tables shares data, shares 
files. You cannot to separate it. More - you have to uses relatively 
aggressive locks to be this operation safe.


Nothing from these points are valid for GTT.


GTT shares metadata.
As far as them are not sharing data, then GTT are safer than regular 
table, aren't them?
"Safer" means that we need less "aggressive" locks for them: we need to 
protect only metadata, not data itself.


My point is that if we allow other sessions to access created indexes 
for regular tables, then it will be not more complex to support it for GTT.
Actually "not more complex" in this case means "no extra efforts are 
needed".


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [Proposal] Global temporary tables

2020-01-29 Thread Pavel Stehule
st 29. 1. 2020 v 18:21 odesílatel Konstantin Knizhnik <
k.knizh...@postgrespro.ru> napsal:

>
>
> On 29.01.2020 20:08, Pavel Stehule wrote:
>
>
>
>
> 2. Actually I do not propose some completely new approach. I try to
>> provide behavior with is compatible with regular tables.
>> If you create index for regular table, then it can be used in all
>> sessions, right?
>>
>
> I don't understand to this point. Regular tables shares data, shares
> files. You cannot to separate it. More - you have to uses relatively
> aggressive locks to be this operation safe.
>
> Nothing from these points are valid for GTT.
>
>
> GTT shares metadata.
> As far as them are not sharing data, then GTT are safer than regular
> table, aren't them?
> "Safer" means that we need less "aggressive" locks for them: we need to
> protect only metadata, not data itself.
>
> My point is that if we allow other sessions to access created indexes for
> regular tables, then it will be not more complex to support it for GTT.
> Actually "not more complex" in this case means "no extra efforts are
> needed".
>

It is hard to say. I see a significant difference. When I do index on
regular table, then I don't change a context of other processes. I have to
wait for lock, and after I got a lock then other processes waiting.

With GTT, I don't want to wait for others - and other processes should
build indexes inside - without expected sequence of operations. Maybe it
can have positive effect, but it can have negative effect too. In this case
I prefer (in this moment) zero effect on other sessions. So I would to
build index in my session and I don't would to wait for other sessions, and
if it is possible other sessions doesn't need to interact or react on my
action too. It should be independent what is possible. The most simple
solution is request on unique usage. I understand so it can be not too
practical. Better is allow to usage GTT by other tables, but the changes
are invisible in other sessions to session reset. It is minimalistic
strategy. It has not benefits for other sessions, but it has not negative
impacts too.



> --
> Konstantin Knizhnik
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>
>


Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-29 Thread Hamid Akhtar
So having seen the feedback on this thread, and I tend to agree with most
of what has been said here, I also agree that the server core isn't really
the ideal place to handle the orphan prepared transactions.

Ideally, these must be handled by a transaction manager, however, I do
believe that we cannot let database suffer for failing of an external
software, and we did a similar change through introduction of idle in
transaction timeout behavior. That said, implementing something similar for
this feature is too much of an overhead both in terms of code complexity
and resources utilisation (if the feature is implemented).

I'm currently working on other options to tackle this problem.


On Tue, 28 Jan 2020 at 9:04 AM, Craig Ringer  wrote:

> On Thu, 23 Jan 2020 at 15:04, Michael Paquier  wrote:
>
> > It seems to me that what you are describing here is a set of
> > properties good for a monitoring tool that we don't necessarily need
> > to maintain in core.  There are already tools able to do that in ways
> > I think are better than what we could ever design, like
> > check_pgactivity and such.
>
> I really have to disagree here.
>
> Relying on external tools gives users who already have to piece
> together a lot of fragments even more moving parts to keep track of.
> It introduces more places where new server releases may not be
> supported in a timely manner by various tools users rely on. More
> places where users may get wrong or incomplete information from
> outdated or incorrect tools. I cite the monstrosity that
> "check_postgres.pl" has become as a specific example of why pushing
> our complexity onto external tools is not always the right answer.
>
> We already have a number of views that prettify information to help
> administrators operate the server. You could argue that
> pg_stat_activity and pg_stat_replication are unnecessary for example;
> users should use external tools to query pg_stat_get_activity(),
> pg_stat_get_wal_senders(), pg_authid and pg_database directly to get
> the information they need. Similarly, we could do away with
> pg_stat_user_indexes and the like, as they're just convenience views
> over lower level information exposed by the server.
>
> But can you really imagine using postgres day to day without
> pg_stat_activity?
>
> It is my firm opinion that visibility into locking behaviour and lock
> waits is of a similar level of importance. So is giving users some way
> to get insight into table and index bloat on our MVCC database. With
> the enormous uptake of various forms of replication and HA it's also
> important that users also be able to see what's affecting resource
> retention - holding down vacuum, retaining WAL, etc.
>
> The server knows more than any tools. Views in the server can also be
> maintained along with the server to address changes in how it manages
> things like resource retention, so external tools get a more
> consistent insight into server behaviour.
>
> > I'd rather just focus in the core code on the basics with views
> > that map directly to what we have in memory and/or disk.
>
> Per above, I just can't agree with this. PostgreSQL is a system with
> end users who need to interact with it, most of whom will not know how
> its innards work. If we're going to position it even more as a
> component in some larger stack such that it's not expected to really
> be used standalone, then we should make some effort to guide users
> toward the other components they will need *in our own documentation*
> and ensure they're tested and maintained.
>
> Proposals to do that with HA and failover tooling, backup tooling etc
> have never got off the ground. I think we do users a great disservice
> there personally. I don't expect any proposal to bless specific
> monitoring tools to be any more successful.
>
> More importantly, I fail to see why every monitoring tool should
> reinvent the same information collection queries and views, each with
> their own unique bugs and quirks, when we can provide information
> users need directly from the server.
>
> In any case I guess it's all hot air unless I pony up a patch to show
> how I think it should work.
>
> --
>  Craig Ringer   http://www.2ndQuadrant.com/
>  2ndQuadrant - PostgreSQL Solutions for the Enterprise
>
>
> --
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950  EMAIL: mailto:hamid.akh...@highgo.ca
SKYPE: engineeredvirus


Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Peter Geoghegan
On Wed, Jan 29, 2020 at 6:56 AM Robert Haas  wrote:
> This (and the rest of the explanation) don't really address my
> concern. I understand that deduplicating in lieu of splitting a page
> in a unique index is highly likely to be a win. What I don't
> understand is why it shouldn't just be a win, period. Not splitting a
> page seems like it has a big upside regardless of whether the index is
> unique -- and in fact, the upside could be a lot bigger for a
> non-unique index. If the coarse-grained LP_DEAD thing is the problem,
> then I can grasp that issue, but you don't seem very worried about
> that.

You're right that I'm not worried about the coarse-grained LP_DEAD
thing here. What I'm concerned about is cases where we attempt
deduplication, but it doesn't work out because there are no duplicates
-- that means we waste some cycles. Or cases where we manage to delay
a split, but only for a very short period of time -- in theory it
would be preferable to just accept the page split up front. However,
in practice we can't make these distinctions, since it would hinge
upon predicting the future, and we don't have a good heuristic. The
fact that a deduplication pass barely manages to prevent an immediate
page split isn't a useful proxy for how likely it is that the page
will split in any timeframe. We might have prevented it from happening
for another 2 milliseconds, or we might have prevented it forever.
It's totally workload dependent.

The good news is that these extra cycles aren't very noticeable even
with a workload where deduplication doesn't help at all (e.g. with
several indexes an append-only table, and few or no duplicates). The
cycles are generally a fixed cost. Furthermore, it seems to be
possible to virtually avoid the problem in the case of unique indexes
by applying the incoming-item-is-duplicate heuristic. Maybe I am
worrying over nothing.

> Generally, I think it's a bad idea to give the user an "emergency off
> switch" and then sometimes ignore it. If the feature seems to be
> generally beneficial, but you're worried that there might be
> regressions in obscure cases, then turn it on by default, and give the
> user the ability to forcibly turn it off. But don't give the the
> opportunity to forcibly turn it off sometimes. Nobody's going to run
> around setting a reloption just for fun -- they're going to do it
> because they hit a problem.

Actually, we do. There is both a reloption and a GUC. The GUC only
works with non-unique indexes, where the extra cost I describe might
be an issue (it can at least be demonstrated in a benchmark).The
reloption works with both unique and non-unique indexes. It will be
useful for turning off deduplication selectively in non-unique
indexes. In the case of unique indexes, it can be thought of as a
debugging thing (though we really don't want users to think about
deduplication in unique indexes at all).

I'm really having a hard time imagining or demonstrating any downside
with unique indexes, given the heuristic, so ISTM that turning off
deduplication really is just a debugging thing there.

My general assumption is that 99%+ of users will want to use
deduplication everywhere. I am concerned about the remaining ~1% of
users who might have a workload that is slightly regressed by
deduplication. Even this small minority of users will still want to
use deduplication in unique indexes. Plus we don't really want to talk
about deduplication in unique indexes to users, since it'll probably
confuse them. That's another reason to treat each case differently.

Again, maybe I'm making an excessively thin distinction. I really want
to be able to enable the feature everywhere, while also not getting
even one complaint about it. Perhaps that's just not a realistic or
useful goal.

-- 
Peter Geoghegan




Re: [Proposal] Global temporary tables

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 10:30 AM Konstantin Knizhnik
 wrote:
> But please consider two arguments:
>
> 1. Index for GTT in any case has to be initialized on demand. In case of
> regular tables index is initialized at the moment of its creation. In
> case of GTT it doesn't work.
> So we should somehow detect that accessed index is not initialized and
> perform lazy initialization of the index.
> The only difference with the approach proposed by Pavel  (allow index
> for empty GTT but prohibit it for GTT filled with data) is whether we
> also need to populate index with data or not.
> I can imagine that implicit initialization of index in read-only query
> (select) can be unsafe and cause some problems. I have not encountered
> such problems yet after performing many tests with GTTs, but certainly I
> have not covered all possible scenarios (not sure that it is possible at
> all).
> But I do not understand how populating  index with data can add some
> extra unsafety.
>
> So I can not prove that building index for GTT on demand is safe, but it
> is not more unsafe than initialization of index on demand which is
> required in any case.

I think that the idea of calling ambuild() on the fly is not going to
work, because, again, I don't think that calling that from random
places in the code is safe. What I expect we're going to need to do
here is model this on the approach used for unlogged tables. For an
unlogged table, each table and index has an init fork which contains
the correct initial contents for that relation - which is nothing at
all for a heap table, and a couple of boilerplate pages for an index.
In the case of an unlogged table, the init forks get copied over the
main forks after crash recovery, and then we have a brand-new, empty
relation with brand-new empty indexes which everyone can use. In the
case of global temporary tables, I think that we should do the same
kind of copying, but at the time when the session first tries to
access the table. There is some fuzziness in my mind about what
exactly constitutes accessing the table - it probably shouldn't be
when the relcache entry is built, because that seems too early, but
I'm not sure what is exactly right. In any event, it's easier to find
a context where copying some files on disk (that are certain not to be
changing) is safe than it is to find a context where index builds are
safe.

> 2. Actually I do not propose some completely new approach. I try to
> provide behavior with is compatible with regular tables.
> If you create index for regular table, then it can be used in all
> sessions, right?

Yes. :-)

> And all "various backend-local data structures in the relcache, the
> planner, and the executor that remember information about indexes"
> have to be properly updated.  It is done using invalidation mechanism.
> The same mechanism is used in case of DDL operations with GTT, because
> we change system catalog.

I mean, that's not really a valid argument. Invalidations can only
take effect at certain points in the code, and the whole argument here
is about which places in the code are safe for which operations, so
the fact that some things (like accepting invalidations) are safe at
some points in the code (like the places where we accept them) does
not prove that other things (like calling ambuild) are safe at other
points in the code (like wherever you are proposing to call it). In
particular, if you've got a relation open, there's currently no way
for another index to show up while you've still got that relation
open. That means that the planner and executor (which keep the
relevant relations open) don't ever have to worry about updating their
data structures, because it can never be necessary. It also means that
any code anywhere in the system that keeps a lock on a relation can
count on the list of indexes for that relation staying the same until
it releases the lock. In fact, it can hold on to pointers to data
allocated by the relcache and count on those pointers being stable for
as long as it holds the lock, and RelationClearRelation contain
specific code that aims to make sure that certain objects don't get
deallocated and reallocated at a different address precisely for that
reason. That code, however, only works as long as nothing actually
changes. The upshot is that it's entirely possible for changing
catalog entries in one backend with an inadequate lock level -- or at
unexpected point in the code -- to cause a core dump either in that
backend or in some other backend. This stuff is really subtle, and
super-easy to screw up.

I am speaking a bit generally here, because I haven't really studied
*exactly* what might go wrong in the relcache, or elsewhere, as a
result of creating an index on the fly. However, I'm very sure that a
general appeal to invalidation messages is not sufficient to make
something like what you want to do safe. Invalidation messages are a
complex, ancient, under-documented, fragile system for solving a very
spe

Re: Removing pg_pltemplate and creating "trustable" extensions

2020-01-29 Thread Tom Lane
I wrote:
> Stephen Frost  writes:
>> On Tue, Jan 28, 2020 at 16:17 Tom Lane  wrote:
>>> On the other hand, there's the point that lots of people have probably
>>> given out schema-CREATE privilege to users whom they wouldn't necessarily
>>> wish to trust with INSTALL privilege.  Schema-CREATE is a pretty harmless
>>> privilege, INSTALL much less so.

>> CREATE doesn't just control the ability to create schemas these days- it
>> was extended to cover publications also not that long ago.

> Oh really ... hm, that does make it a much bigger deal than I was
> thinking.  Given that, I don't think there's any huge objection to
> attaching this to CREATE, at least till we get around to a more
> significant redesign.

Here's a v5 that drops the new predefined role and allows
trusted-extension installation when you have CREATE on the current
database.  There's no other changes except a bit of documentation
wordsmithing.

Barring further complaints, I'm going to push this fairly soon.

regards, tom lane

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 85ac79f..a10b665 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -8519,7 +8519,15 @@ SCRAM-SHA-256$:&l
  
   superuser
   bool
-  True if only superusers are allowed to install this extension
+  True if only superusers are allowed to install this extension
+   (but see trusted)
+ 
+
+ 
+  trusted
+  bool
+  True if the extension can be installed by non-superusers
+   with appropriate privileges
  
 
  
diff --git a/doc/src/sgml/ddl.sgml b/doc/src/sgml/ddl.sgml
index 3546e39..8d3a0d1 100644
--- a/doc/src/sgml/ddl.sgml
+++ b/doc/src/sgml/ddl.sgml
@@ -1742,6 +1742,7 @@ REVOKE ALL ON accounts FROM PUBLIC;
  
   
For databases, allows new schemas and publications to be created within
+   the database, and allows trusted extensions to be installed within
the database.
   
   
@@ -1753,8 +1754,11 @@ REVOKE ALL ON accounts FROM PUBLIC;
   
For tablespaces, allows tables, indexes, and temporary files to be
created within the tablespace, and allows databases to be created that
-   have the tablespace as their default tablespace.  (Note that revoking
-   this privilege will not alter the placement of existing objects.)
+   have the tablespace as their default tablespace.
+  
+  
+   Note that revoking this privilege will not alter the existence or
+   location of existing objects.
   
  
 
diff --git a/doc/src/sgml/extend.sgml b/doc/src/sgml/extend.sgml
index a3046f2..ffe068b 100644
--- a/doc/src/sgml/extend.sgml
+++ b/doc/src/sgml/extend.sgml
@@ -576,6 +576,31 @@
 version.  If it is set to false, just the privileges
 required to execute the commands in the installation or update script
 are required.
+This should normally be set to true if any of the
+script commands require superuser privileges.  (Such commands would
+fail anyway, but it's more user-friendly to give the error up front.)
+   
+  
+ 
+
+ 
+  trusted (boolean)
+  
+   
+This parameter, if set to true (which is not the
+default), allows some non-superusers to install an extension that
+has superuser set to true.
+Specifically, installation will be permitted for anyone who has
+CREATE privilege on the current database.
+When the user executing CREATE EXTENSION is not
+a superuser but is allowed to install by virtue of this parameter,
+then the installation or update script is run as the bootstrap
+superuser, not as the calling user.
+This parameter is irrelevant if superuser is
+false.
+Generally, this should not be set true for extensions that could
+allow access to otherwise-superuser-only abilities, such as
+filesystem access.

   
  
@@ -642,6 +667,18 @@
 
 
 
+ If the extension script contains the
+ string @extowner@, that string is replaced with the
+ (suitably quoted) name of the user calling CREATE
+ EXTENSION or ALTER EXTENSION.  Typically
+ this feature is used by extensions that are marked trusted to assign
+ ownership of selected objects to the calling user rather than the
+ bootstrap superuser.  (One should be careful about doing so, however.
+ For example, assigning ownership of a C-language function to a
+ non-superuser would create a privilege escalation path for that user.)
+
+
+
  While the script files can contain any characters allowed by the specified
  encoding, control files should contain only plain ASCII, because there
  is no way for PostgreSQL to know what encoding a
diff --git a/doc/src/sgml/ref/create_extension.sgml b/doc/src/sgml/ref/create_extension.sgml
index 36837f9..d76ac3e 10

Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 1:15 PM Peter Geoghegan  wrote:
> The good news is that these extra cycles aren't very noticeable even
> with a workload where deduplication doesn't help at all (e.g. with
> several indexes an append-only table, and few or no duplicates). The
> cycles are generally a fixed cost. Furthermore, it seems to be
> possible to virtually avoid the problem in the case of unique indexes
> by applying the incoming-item-is-duplicate heuristic. Maybe I am
> worrying over nothing.

Yeah, maybe. I'm tempted to advocate for dropping the GUC and keeping
the reloption. If the worst case is a 3% regression and you expect
that to be rare, I don't think a GUC is really worth it, especially
given that the proposed semantics seem somewhat confusing. The
reloption can be used in a pinch to protect against either bugs or
performance regressions, whichever may occur, and it doesn't seem like
you need a second mechanism.

> Again, maybe I'm making an excessively thin distinction. I really want
> to be able to enable the feature everywhere, while also not getting
> even one complaint about it. Perhaps that's just not a realistic or
> useful goal.

One thing that you could do is try to learn whether deduplication (I
really don't like that name, but here we are) seems to be working for
a given index, perhaps even in a given session. For instance, suppose
you keep track of what happened the last ten times the current session
attempted deduplication within a given index. Store the state in the
relcache. If all of the last ten tries were failures, then only try
1/4 of the time thereafter. If you have a success, go back to trying
every time. That's pretty crude, but it would might be good enough to
blunt the downsides to the point where you can stop worrying.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] kqueue

2020-01-29 Thread Mark Wong
On Sat, Jan 25, 2020 at 11:29:11AM +1300, Thomas Munro wrote:
> On Thu, Jan 23, 2020 at 9:38 AM Rui DeSousa  wrote:
> > On Jan 22, 2020, at 2:19 PM, Tom Lane  wrote:
> >> It's certainly possible that to see any benefit you need stress
> >> levels above what I can manage on the small box I've got these
> >> OSes on.  Still, it'd be nice if a performance patch could show
> >> some improved performance, before we take any portability risks
> >> for it.
> 
> You might need more than one CPU socket, or at least lots more cores
> so that you can create enough contention.  That was needed to see the
> regression caused by commit ac1d794 on Linux[1].
> 
> > Here is two charts comparing a patched and unpatched system.
> > These systems are very large and have just shy of thousand
> > connections each with averages of 20 to 30 active queries concurrently
> > running at times including hundreds if not thousand of queries hitting
> > the database in rapid succession.  The effect is the unpatched system
> > generates a lot of system load just handling idle connections where as
> > the patched version is not impacted by idle sessions or sessions that
> > have already received data.
> 
> Thanks.  I can reproduce something like this on an Azure 72-vCPU
> system, using pgbench -S -c800 -j32.  The point of those settings is
> to have many backends, but they're all alternating between work and
> sleep.  That creates a stream of poll() syscalls, and system time goes
> through the roof (all CPUs pegged, but it's ~half system).  Profiling
> the kernel with dtrace, I see the most common stack (by a long way) is
> in a poll-related lock, similar to a profile Rui sent me off-list from
> his production system.  Patched, there is very little system time and
> the TPS number goes from 539k to 781k.
> 
> [1] 
> https://www.postgresql.org/message-id/flat/CAB-SwXZh44_2ybvS5Z67p_CDz%3DXFn4hNAD%3DCnMEF%2BQqkXwFrGg%40mail.gmail.com

Just to add some data...

I tried the kqueue v14 patch on a AWS EC2 m5a.24xlarge (96 vCPU) with
FreeBSD 12.1, driving from a m5.8xlarge (32 vCPU) CentOS 7 system.

I also use pgbench with a scale factor of 1000, with -S -c800 -j32.

Comparing pg 12.1 vs 13-devel (30012a04):

* TPS increased from ~93,000 to ~140,000, ~ 32% increase
* system time dropped from ~ 78% to ~ 70%, ~ 8% decrease
* user time increased from ~16% to ~ 23%, ~7% increase

I don't have any profile data, but I've attached a couple chart showing
the processor utilization over a 15 minute interval from the database
system.

Regards,
Mark
-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/


Marking some contrib modules as trusted extensions

2020-01-29 Thread Tom Lane
Now that we're just about there on the patch to invent trusted
extensions [1], I'd like to start a discussion about whether to mark
anything besides the trusted PLs as trusted.  I think generally
we ought to mark contrib modules as trusted if it's sane to do so;
there's not much point in handing people plperl (even sandboxed)
but not, say, hstore.  I trawled through what's in contrib today
and broke things down like this:

Certainly NO, as these allow external or low-level access:

adminpack
dblink
file_fdw
postgres_fdw
pageinspect
pg_buffercache
pg_freespacemap
pg_visibility
pgstattuple

Probably NO, if only because you'd need additional privileges
to use these anyway:

amcheck
dict_xsyn
hstore_plperlu
hstore_plpython2u
hstore_plpython3u
hstore_plpythonu
jsonb_plperlu
jsonb_plpython2u
jsonb_plpython3u
jsonb_plpythonu
ltree_plpython2u
ltree_plpython3u
ltree_plpythonu
pg_prewarm
pg_stat_statements

Definitely candidates to mark trusted:

citext
cube
dict_int(unlike dict_xsyn, this needs no external file)
earthdistance   (marginal usefulness though)
fuzzystrmatch
hstore
hstore_plperl
intagg  (marginal usefulness though)
intarray
isn
jsonb_plperl
lo
ltree
pg_trgm
pgcrypto
seg
tablefunc
tcn
tsm_system_rows
tsm_system_time
unaccent(needs external file, but the default one is useful)
uuid-ossp

Not sure what I think about these:

bloom   (are these useful in production?)
btree_gin
btree_gist
pgrowlocks  (seems safe, but are there security issues?)
spi/autoinc (I doubt that these four are production grade)
spi/insert_username
spi/moddatetime
spi/refint
sslinfo (seems safe, but are there security issues?)
xml2(nominally safe, but deprecated, and libxml2
 has been a fertile source of security issues)

Any opinions about these, particularly the on-the-edge cases?

Also, how should we document this, if we do it?  Add a boilerplate
sentence to each module's description about whether it is trusted
or not?  Put a table up at the front of Appendxix F?  Both?

I'm happy to go make this happen, once we have consensus on what
should happen.

regards, tom lane

[1] https://www.postgresql.org/message-id/flat/5889.1566415762%40sss.pgh.pa.us




Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Peter Geoghegan
On Wed, Jan 29, 2020 at 10:41 AM Robert Haas  wrote:
> Yeah, maybe. I'm tempted to advocate for dropping the GUC and keeping
> the reloption. If the worst case is a 3% regression and you expect
> that to be rare, I don't think a GUC is really worth it, especially
> given that the proposed semantics seem somewhat confusing. The
> reloption can be used in a pinch to protect against either bugs or
> performance regressions, whichever may occur, and it doesn't seem like
> you need a second mechanism.

I like the idea of getting rid of the GUC.

I should stop talking about it for now, and go back to reassessing the
extent of the regression in highly unsympathetic cases. The patch has
become faster in a couple of different ways since I last looked at
this question, and it's entirely possible that the regression is even
smaller than it was before.

> One thing that you could do is try to learn whether deduplication (I
> really don't like that name, but here we are) seems to be working for
> a given index, perhaps even in a given session. For instance, suppose
> you keep track of what happened the last ten times the current session
> attempted deduplication within a given index. Store the state in the
> relcache.

It's tempting to try to reason about the state of an index over time
like this, but I don't think that it's ever going to work well.
Imagine a unique index where 50% of all values are NULLs, on an
append-only table. Actually, let's say it's a non-unique index with
unique integers, and NULL values for the remaining 50% of rows -- that
way we don't get the benefit of the incoming-item-is-duplicate
heuristic.

There will be two contradictory tendencies within this particular
index. We might end up ping-ponging between each behavior. It seems
better to just accept a small performance hit.

Sometimes we can add heuristics to things like the split point
location choice logic (nbtsplitloc.c) that behave as if we were
keeping track of how things progress across successive, related page
splits in the same index -- though we don't actually keep track of
anything. I'm thinking of things like Postgres v12's "split after new
tuple" optimization, which makes the TPC-C indexes so much smaller. We
can do these things statelessly and safely only because the heuristics
have little to lose and much to gain. To a lesser extent, it's okay
because the heuristics are self-limiting -- we can only make an
incorrect inference about what to do because we were unlucky, but
there is no reason to think that we'll consistently make the wrong
choice. It feels a little bit like quicksort to me.

-- 
Peter Geoghegan




parens cleanup

2020-01-29 Thread Alvaro Herrera
Some ereport calls have excess sets of parentheses.  patch 0001 removes
the ones I found in a very quick grep.

0002 removes newlines immediately following parens.  These were
previously useful because pgindent would move arguments further to the
left so that the line would fit under 80 chars.  However, pgindent no
longer does that, so the newlines are pointless and ugly.

These being cosmetic cleanups, they're not intended for backpatch,
though an argument could be made that doing that would save some future
backpatching pain.  If ther are sufficient votes for that, I'm open to
doing it.  (Of course, 0002 would not be backpatched further back than
pg10, the first release that uses the "new" pgindent rules.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From dc31967fd68053e3c03ac70f5be231c3d0720be6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 29 Jan 2020 15:45:47 -0300
Subject: [PATCH 1/2] Remove excess parens in ereport() calls

Cosmetic cleanup, not worth backpatching.
---
 contrib/adminpack/adminpack.c  | 12 ++--
 contrib/hstore_plperl/hstore_plperl.c  |  2 +-
 contrib/jsonb_plperl/jsonb_plperl.c|  6 +++---
 contrib/jsonb_plpython/jsonb_plpython.c|  8 
 contrib/pageinspect/brinfuncs.c|  8 
 contrib/pageinspect/btreefuncs.c   |  8 
 contrib/pageinspect/fsmfuncs.c |  2 +-
 contrib/pageinspect/ginfuncs.c |  6 +++---
 contrib/pageinspect/hashfuncs.c| 10 +-
 contrib/pageinspect/heapfuncs.c|  2 +-
 contrib/pageinspect/rawpage.c  |  6 +++---
 contrib/pg_prewarm/pg_prewarm.c|  4 ++--
 contrib/pgstattuple/pgstatapprox.c |  2 +-
 contrib/pgstattuple/pgstatindex.c  | 10 +-
 contrib/pgstattuple/pgstattuple.c  |  4 ++--
 src/backend/access/transam/xlogfuncs.c |  4 ++--
 src/backend/commands/alter.c   |  8 
 src/backend/commands/collationcmds.c   |  2 +-
 src/backend/commands/createas.c|  2 +-
 src/backend/commands/functioncmds.c|  4 ++--
 src/backend/commands/publicationcmds.c |  2 +-
 src/backend/commands/subscriptioncmds.c|  2 +-
 src/backend/commands/user.c|  4 ++--
 src/backend/libpq/auth-scram.c |  8 
 src/backend/libpq/be-secure-openssl.c  |  6 +++---
 src/backend/postmaster/syslogger.c |  4 ++--
 src/backend/replication/logical/logical.c  |  6 +++---
 src/backend/replication/logical/logicalfuncs.c |  2 +-
 src/backend/replication/repl_gram.y|  4 ++--
 src/backend/replication/slot.c |  2 +-
 src/backend/replication/slotfuncs.c|  4 ++--
 src/backend/replication/walsender.c|  6 +++---
 src/backend/storage/ipc/procarray.c|  4 ++--
 src/backend/storage/ipc/signalfuncs.c  | 14 +++---
 src/backend/utils/adt/genfile.c| 12 ++--
 src/backend/utils/adt/pg_upgrade_support.c |  2 +-
 src/backend/utils/misc/guc.c   |  2 +-
 37 files changed, 97 insertions(+), 97 deletions(-)

diff --git a/contrib/adminpack/adminpack.c b/contrib/adminpack/adminpack.c
index 7b5a531e08..29b46aea3e 100644
--- a/contrib/adminpack/adminpack.c
+++ b/contrib/adminpack/adminpack.c
@@ -93,7 +93,7 @@ convert_and_check_filename(text *arg, bool logAllowed)
 		if (path_contains_parent_reference(filename))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 (errmsg("reference to parent directory (\"..\") not allowed";
+	 errmsg("reference to parent directory (\"..\") not allowed")));
 
 		/*
 		 * Allow absolute paths if within DataDir or Log_directory, even
@@ -104,12 +104,12 @@ convert_and_check_filename(text *arg, bool logAllowed)
 			 !path_is_prefix_of_path(Log_directory, filename)))
 			ereport(ERROR,
 	(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-	 (errmsg("absolute path not allowed";
+	 errmsg("absolute path not allowed")));
 	}
 	else if (!path_is_relative_and_below_cwd(filename))
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- (errmsg("path must be in or below the current directory";
+ errmsg("path must be in or below the current directory")));
 
 	return filename;
 }
@@ -124,7 +124,7 @@ requireSuperuser(void)
 	if (!superuser())
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- (errmsg("only superuser may access generic file functions";
+ errmsg("only superuser may access generic file functions")));
 }
 
 
@@ -485,7 +485,7 @@ pg_logdir_ls(PG_FUNCTION_ARGS)
 	if (!superuser())
 		ereport(ERROR,
 (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
- (errmsg("only superuser can list the log directory";
+ errmsg("only sup

Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Alvaro Herrera
On 2020-Jan-29, Tom Lane wrote:

> Not sure what I think about these:
> 
> bloom (are these useful in production?)
> btree_gin
> btree_gist
> pgrowlocks(seems safe, but are there security issues?)
> spi/autoinc   (I doubt that these four are production grade)
> spi/insert_username
> spi/moddatetime
> spi/refint
> sslinfo   (seems safe, but are there security issues?)
> xml2  (nominally safe, but deprecated, and libxml2
>has been a fertile source of security issues)

Of these, btree_gist is definitely useful from a user perspective,
because it enables creation of certain exclusion constraints.

I've never heard of anyone using bloom indexes in production.  I'd
argue that if the feature is useful, then we should turn it into a
core-included index AM with regular WAL logging for improved
performance, and add a stripped-down version to src/test/modules to
cover the WAL-log testing needs.  Maybe exposing it more, as promoting
it as a trusted extension would do, would help find more use cases for
it.

> Also, how should we document this, if we do it?  Add a boilerplate
> sentence to each module's description about whether it is trusted
> or not?  Put a table up at the front of Appendxix F?  Both?

If it were possible to do both from a single source of truth, that would
be great.  Failing that, I'd just list it in each module's section.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Default JIT setting in V12

2020-01-29 Thread Soumyadeep Chakraborty
Hello,

Based on this thread, Alexandra and I decided to investigate if we could
borrow
some passes from -O1 and add on to the default optimization of -O0 and
mem2reg.
To determine what passes would make most sense, we ran ICW with
jit_above_cost
set to 0, dumped all the backends and then analyzed them with 'opt'. Based
on
the stats dumped that the instcombine pass and sroa had the most scope for
optimization. We have attached the stats we dumped.

Then, we investigated whether mixing in sroa and instcombine gave us a
better
run time. We used TPCH Q1 (TPCH repo we used:
https://github.com/dimitri/tpch-citus) at scales of 1, 5 and 50. We found
that
there was no significant difference in query runtime over the default of -O0
with mem2reg.

We also performed the same experiment with -O1 as the default
optimization level, as Andres had suggested on this thread. We found
that the results were much more promising (refer the results for scale
= 5 and 50 below). At the lower scale of 1, we had to force optimization
to meet the query cost. There was no adverse impact from increased
query optimization time due to the ramp up to -O1 at this lower scale.


Results summary (eyeball-averaged over 5 runs, excluding first run after
restart. For each configuration we flushed the OS cache and restarted the
database):

settings: max_parallel_workers_per_gather = 0

scale = 50:
-O3  : 77s
-O0 + mem2reg   : 107s
-O0 + mem2reg + instcombine: 107s
-O0 + mem2reg + sroa: 107s
-O0 + mem2reg + sroa + instcombine : 107s
-O1   : 84s

scale = 5:
-O3   : 8s
-O0 + mem2reg: 10s
-O0 + mem2reg + instcombine : 10s
-O0 + mem2reg + sroa : 10s
-O0 + mem2reg + sroa + instcombine : 10s
-O1   : 8s


scale = 1:
-O3   : 1.7s
-O0 + mem2reg: 1.7s
-O0 + mem2reg + instcombine: 1.7s
-O0 + mem2reg + sroa : 1.7s
-O0 + mem2reg + sroa + instcombine : 1.7s
-O1   : 1.7s

Based on the evidence above, maybe it is worth considering ramping up the
default optimization level to -O1.

Regards,

Soumyadeep and Alexandra


opt_dump.pdf
Description: Adobe PDF document


Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Komяpa
Hello,


> btree_gin
> btree_gist


I would even ask btree_gin and btree_gist to be moved to core.

btree_gist is shipping opclasses for built in types to be used in gist
indexes. btree_* is confusing part in the name pretending there's some
magic happening linking btree and gist.

gist is the most popular way to get geometric indexes, and these often need
to be combined with some class identifier that's used in lookups together.
CREATE INDEX on geom_table using gist (zooom_level, geom); fails for no
reason without btree_gist - types are shipped in core,
gist itself is not an extension, but letting to use one core mechanism with
another in an obvious way is for some reason split out.


-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Collation versioning

2020-01-29 Thread Julien Rouhaud
On Tue, Jan 28, 2020 at 1:06 PM Peter Eisentraut
 wrote:
>
> On 2019-12-17 14:25, Julien Rouhaud wrote:
> > PFA rebased v6, with the following changes:
>
> Some thoughts on this patch set.

Thanks for looking at it!

> I think we are all agreed on the general idea.
>
> 0001-0003 seem pretty much OK.  Why is pg_depend.refobjversion of type
> "name" whereas the previous pg_collation.collversion was type "text"?
> Related to that, we previously used null to indicate an unknown
> collation version, and now it's an empty string.

That's what Thomas implemented when he started to work on it and I
simply kept it that way until now.  I'm assuming that it was simply to
avoid wasting time on the varlena stuff while working on the
prototype, so barring any objections I'll change to nullable text
column in the next revision.

> For 0005, if the new commands are only to be used in binary upgrades,
> then they should be implemented as built-in functions instead of full
> DDL commands.  There is precedent for that.

Oh, I wasn't aware of that.  I can definitely use built-in functions
instead, but some people previously argued that those command should
be available even in non binary upgrade and I'm not clear on whether
this is wanted or not.  Do you have any thoughts on that?

> The documentation snippet for this patch talks about upgrades from PG12
> to later.  But what about upgrades on platforms where we currently don't
> have collation versioning but might introduce it later (FreeBSD,
> Windows)?  This needs to be generalized.

Good point, I'll try to improve that.

> For 0006 ("Add a new ALTER INDEX name DEPENDS ON COLLATION name
>   CURRENT VERSION"), I find the syntax misleading.  This command doesn't
> (or shouldn't) add a new dependency, it only changes the version of an
> existing dependency.  The previously used syntax ALTER COLLATION /
> REFRESH VERSION is a better vocabulary, I think.

I agree and also complained about that syntax too.  I'm however
struggling on coming up with a syntax that makes it clear it's
refreshing the version of a collation the index already depends on.
E.g.:

ALTER INDEX name ALTER COLLATION name REFRESH VERSION

is still quite poor, but I don't have anything better.  Do you have
some better suggestion or should I go with that?

> I think this whole thing needs more tests.  We are designing this
> delicate mechanism but have no real tests for it.  We at least need to
> come up with a framework for how to test this automatically, so that we
> can add more test cases over time as people will invariably report
> issues when this hits the real world.

Indeed.  I have some unlikely index test cases I'm for now using to
validate the behavior, but didn't start a real test infrastructure.
I'll also work on that for the next revision, although I'll need some
more thinking on how to properly test the pg_upgrade part.




Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Robert Haas
On Wed, Jan 29, 2020 at 2:50 PM Peter Geoghegan  wrote:
> It's tempting to try to reason about the state of an index over time
> like this, but I don't think that it's ever going to work well.
> Imagine a unique index where 50% of all values are NULLs, on an
> append-only table. Actually, let's say it's a non-unique index with
> unique integers, and NULL values for the remaining 50% of rows -- that
> way we don't get the benefit of the incoming-item-is-duplicate
> heuristic.

I mean, if you guess wrong and deduplicate less frequently, you are no
worse off than today.

But it depends, too, on the magnitude.  If a gain is both large and
probable and a loss is both unlikely and improbable, then accepting a
bit of slowdown when it happens may be the right call.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Andrew Dunstan
On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
 wrote:
>
>
> On 1/28/20 5:28 PM, Mark Dilger wrote:
> >
> >
> >> +# There doesn't seem to be any easy way to get TestLib to use the 
> >> binaries from
> >> +# our directory, so we hack up a path to our binary and run that
> >> directly.  This
> >> +# seems brittle enough that some other solution should be found, if 
> >> possible.
> >> +
> >> +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
> >>
> >> I don't know what the right thing to do here is. Perhaps someone more
> >> familiar with TAP testing can comment.
> > Yeah, I was hoping that might get a comment from Andrew.  I think if it 
> > works as-is on windows, we could just use it this way until it causes a 
> > problem on some platform or other.  It’s not a runtime issue, being only a 
> > build-time test, and only then when tap tests are enabled *and* running 
> > check-world, so nobody should really be adversely affected.  I’ll likely 
> > get around to testing this on Windows, but I don’t have any Windows 
> > environments set up yet, as that is still on my todo list.
> >
>
>
> I think using TESTDIR is Ok,


I've changed my mind, I don't think that will work for MSVC, the
executable gets built elsewhere for that. I'll try to come up with
something portable.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Mark Dilger



> On Jan 29, 2020, at 1:02 PM, Andrew Dunstan  
> wrote:
> 
> On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
>  wrote:
>> 
>> 
>> On 1/28/20 5:28 PM, Mark Dilger wrote:
>>> 
>>> 
 +# There doesn't seem to be any easy way to get TestLib to use the 
 binaries from
 +# our directory, so we hack up a path to our binary and run that
 directly.  This
 +# seems brittle enough that some other solution should be found, if 
 possible.
 +
 +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
 
 I don't know what the right thing to do here is. Perhaps someone more
 familiar with TAP testing can comment.
>>> Yeah, I was hoping that might get a comment from Andrew.  I think if it 
>>> works as-is on windows, we could just use it this way until it causes a 
>>> problem on some platform or other.  It’s not a runtime issue, being only a 
>>> build-time test, and only then when tap tests are enabled *and* running 
>>> check-world, so nobody should really be adversely affected.  I’ll likely 
>>> get around to testing this on Windows, but I don’t have any Windows 
>>> environments set up yet, as that is still on my todo list.
>>> 
>> 
>> 
>> I think using TESTDIR is Ok,
> 
> 
> I've changed my mind, I don't think that will work for MSVC, the
> executable gets built elsewhere for that. I'll try to come up with
> something portable.

I’m just now working on getting my Windows VMs set up with Visual Studio and 
whatnot, per the wiki instructions, so I don’t need to burden you with this 
sort of Windows task in the future.  If there are any gotchas not mentioned on 
the wiki, I’d appreciate pointers about how to avoid them.  I’ll try to help 
devise a solution, or test what you come up with, once I’m properly set up for 
that.

For no particular reason, I chose Windows Server 2019 and Windows 10 Pro.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Tom Lane
=?UTF-8?Q?Darafei_=22Kom=D1=8Fpa=22_Praliaskouski?=  writes:
>> btree_gin
>> btree_gist

> I would even ask btree_gin and btree_gist to be moved to core.

That's not in scope here.  Our past experience with trying to move
extensions into core is that it creates a pretty painful upgrade
experience for users, so that's not something I'm interested in doing
... especially for relatively marginal cases like these.

There's also a more generic question of why we should want to move
anything to core anymore.  The trusted-extension mechanism removes
one of the biggest remaining gripes about extensions, namely the
pain level for installing them.  (But please, let's not have that
debate on this thread.)

regards, tom lane




Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Julien Rouhaud
On Wed, Jan 29, 2020 at 9:46 PM Darafei "Komяpa" Praliaskouski
 wrote:
>
> Hello,
>
>>
>> btree_gin
>> btree_gist
>
>
> I would even ask btree_gin and btree_gist to be moved to core.

Without going that far, I also agree that I relied on those extension
quite often, so +1 for marking them as trusted.

>> Probably NO, if only because you'd need additional privileges
>> to use these anyway:
>> pg_stat_statements

But the additional privileges are global, so assuming the extension
has been properly setup, wouldn't it be sensible to ease the
per-database installation?  If not properly setup, there's no harm in
creating the extension anyway.




Re: Marking some contrib modules as trusted extensions

2020-01-29 Thread Tom Lane
Julien Rouhaud  writes:
>>> Probably NO, if only because you'd need additional privileges
>>> to use these anyway:
>>> pg_stat_statements

> But the additional privileges are global, so assuming the extension
> has been properly setup, wouldn't it be sensible to ease the
> per-database installation?  If not properly setup, there's no harm in
> creating the extension anyway.

Mmm, I'm not convinced --- the ability to see what statements are being
executed in other sessions (even other databases) is something that
paranoid installations might not be so happy about.  Our previous
discussions about what privilege level is needed to look at
pg_stat_statements info were all made against a background assumption
that you needed some extra privilege to set up the view in the first
place.  I think that would need another look or two before being
comfortable that we're not shifting the goal posts too far.

The bigger picture here is that I don't want to get push-back that
we've broken somebody's security posture by marking too many extensions
trusted.  So for anything where there's any question about security
implications, we should err in the conservative direction of leaving
it untrusted.

regards, tom lane




Re: parens cleanup

2020-01-29 Thread Tom Lane
Alvaro Herrera  writes:
> Some ereport calls have excess sets of parentheses.  patch 0001 removes
> the ones I found in a very quick grep.

+1 ... kind of looks like somebody got this wrong long ago, and then
various people copied-and-pasted from a bad example.

> 0002 removes newlines immediately following parens.  These were
> previously useful because pgindent would move arguments further to the
> left so that the line would fit under 80 chars.  However, pgindent no
> longer does that, so the newlines are pointless and ugly.

+1 except for the changes in zic.c.  Those line breaks are following
the upstream code, so I'd just put them back in the next merge ...

> These being cosmetic cleanups, they're not intended for backpatch,
> though an argument could be made that doing that would save some future
> backpatching pain.  If ther are sufficient votes for that, I'm open to
> doing it.  (Of course, 0002 would not be backpatched further back than
> pg10, the first release that uses the "new" pgindent rules.)

Meh, -0.1 or so on back-patching.

regards, tom lane




Re: standby apply lag on inactive servers

2020-01-29 Thread Alvaro Herrera
On 2020-Jan-28, Kyotaro Horiguchi wrote:

> The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
> XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
> could be included for consistency.)

Hmm I think I should definitely include those.

> The timestamp of a checkpoint record is the start time of a checkpoint
> (and doesn't have subseconds part, but it's not a problem.). That
> means that the timestamp is usually far behind the time at the record
> has been inserted. As the result the stored timestamp can go back by a
> significant internval. I don't think that causes an actual problem but
> the movement looks wierd as the result of
> pg_last_xact_replay_timestamp().

A problem I see with this is that setting the timestamp is done with
XLogCtl->lock held; since now we need to make the update conditional,
we'd have to read the current value, compare with the checkpoint time,
then set, all with the spinlock held.  That seems way too expensive.

A compromise might be to do the compare only when it's done for
checkpoint.  These occur seldom enough that it shouldn't be a problem
(as opposed to commit records, which can be very frequent).

> Asides from the backward movement, a timestamp from other than
> commit/abort records in recvoeryLastXTime affects the following code.
> 
> xlog.c:7329: (similar code exists at line 9332)
> >ereport(LOG,
> > (errmsg("redo done at %X/%X",
> > (uint32) (ReadRecPtr >> 32), (uint32) 
> > ReadRecPtr)));
> >xtime = GetLatestXTime();
> >if (xtime)
> > ereport(LOG,
> > (errmsg("last completed transaction was at log time %s",
> > timestamptz_to_str(xtime;
> 
> This code assumes (and the name GetLatestXTime() suggests, I first
> noticed that here..) that the timestamp comes from commit/abort logs,
> so otherwise it shows a wrong timestamp.  We shouldn't update the
> variable by other than that kind of records.

Thinking about this some more, I think we should do keep the message the
same backbranches (avoid breaking anything that might be reading the log
-- a remote but not inexistent possibility), and adjust the message in
master to be something like "last timestamped WAL activity at time %s",
and document that it means commit, abort, restore label, checkpoint.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Memory-Bounded Hash Aggregation

2020-01-29 Thread Jeff Davis
On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote:
> That sounds weird. Might be pathological in some sense.
> 
> I have a wild guess for you. Maybe this has something to do with the
> "test for presorted input" added by commit a3f0b3d68f9. That can
> perform very badly when the input is almost sorted, but has a few
> tuples that are out of order towards the end. (I have called these
> "banana skin tuples" in the past.)

My simple test case is: 'explain analyze select i from big group by
i;', where "big" has 20M tuples.

I tried without that change and it helped (brought the time from 55s to
45s). But if I completely remove the sorting of the freelist, it goes
down to 12s. So it's something about the access pattern.

After digging a bit more, I see that, for Sort, the LogicalTapeSet's
freelist hovers around 300 entries and doesn't grow larger than that.
For HashAgg, it gets up to almost 60K. The pattern in HashAgg is that
the space required is at a maximum after the first spill, and after
that point the used space declines with each batch (because the groups
that fit in the hash table were finalized and emitted, and only the
ones that didn't fit were written out). As the amount of required space
declines, the size of the freelist grows.

That leaves a few options:

1) Cap the size of the LogicalTapeSet's freelist. If the freelist is
growing large, that's probably because it will never actually be used.
I'm not quite sure how to pick the cap though, and it seems a bit hacky
to just leak the freed space.

2) Use a different structure more capable of handling a large fraction
of free space. A compressed bitmap might make sense, but that seems
like overkill to waste effort tracking a lot of space that is unlikely
to ever be used.

3) Don't bother tracking free space for HashAgg at all. There's already
an API for that so I don't need to further hack logtape.c.

4) Try to be clever and shrink the file (or at least the tracked
portion of the file) if the freed blocks are at the end. This wouldn't
be very useful in the first recursive level, but the problem is worst
for the later levels anyway. Unfortunately, I think this requires a
breadth-first strategy to make sure that blocks at the end get freed.
If I do change it to breadth-first also, this does amount to a
significant speedup.

I am leaning toward #1 or #3.

As an aside, I'm curious why the freelist is managed the way it is.
Newly-released blocks are likely to be higher in number (or at least
not the lowest in number), but they are added to the end of an array.
The array is therefore likely to require repeated re-sorting to get
back to descending order. Wouldn't a minheap or something make more
sense?

Regards,
Jeff Davis






Re: Add %x to PROMPT1 and PROMPT2

2020-01-29 Thread Vik Fearing
On 29/01/2020 08:25, Fabien COELHO wrote:
> 
> Hello Vik,
> 
>>> Isn't there examples in the documentation which use the default prompts?
>>>
>>> If so, should they be updated accordingly?
>>
>> Good catch!
>> I thought about the documentation but not the examples therein.
>>
>> Updated patch attached.
> 
> Ok.
> 
> Only one transaction prompt example in the whole documentation:-(
> No tests is troubled by the change:-(
> Sigh…
> 
> Patch applies and compiles cleanly, global and psql make check ok.
> 
> Doc build ok.
> 
> Works for me.

Thanks for the review!

Would you mind changing the status in the commitfest app?
https://commitfest.postgresql.org/27/2427/
-- 
Vik Fearing




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-29 Thread Kyotaro Horiguchi
Hello.

At Wed, 29 Jan 2020 23:24:09 +0900, Fujii Masao  
wrote in 
> On 2020/01/29 20:06, Kasahara Tatsuhito wrote:
> > Hi.
> > Attached patch solve this problem.
> > This patch adds table_beginscan_tid() and call it in TidListEval()
> > instead of table_beginscan().
> > table_beginscan_tid() is the same as table_beginscan() but do not set
> > SO_TYPE_SEQSCAN to flags.
> > Although I'm not sure this behavior is really problem or not,
> > it seems to me that previous behavior is more prefer.
> > Is it worth to apply to HEAD and v12 branch ?
> 
> I've not read the patch yet, but I agree that updating only seq_scan
> but not seq_tup_read in Tid Scan sounds strange. IMO at least
> both should be update together or neither should be updated.

Basically agreed, but sample scan doesn't increment seq_scan but
increments seq_tup_read.

Aside from that fact, before 147e3722f7 TidScan didn't need a scan
descriptor so didn't call table_beginscan. table_beginscan didn't
increment the counter for bitmapscan and samplescan. The commit
changes TidScan to call beginscan but didn't change table_beginscan
not to increment the counter for tidscans.

>From the view of the view pg_stat_*_tables, the counters moves as follows.

increments
scan typetable_beginscan?, per scan, per tuple , SO_TYPE flags
=
seq scan   : yes , seq_scan, seq_tup_read  , SO_TYPE_SEQSCAN
index scan : no  , idx_scan, idx_tup_fetch , 
bitmap scan: yes , idx_scan, idx_tup_fetch , SO_TYPE_BITMAPSCAN
sample scan: yes ,   , seq_tup_read  , SO_TYPE_SAMPLESCAN
TID scan   : yes , seq_scan, , 

bitmap scan and sample scan are historically excluded by corresponding
flags is_bitmapscan and is_samplescan and the commit c3b23ae457 moved
the work to SO_TYPE_* flags.  After 147e3722f7, TID scan has the same
characteristics, that is, it calls table_beginscan but doesn't
increment seq_scan.  But it doesn't have corresponding flag value.

I'd rather think that whatever calls table_beginscan should have
corresponding SO_TYPE_* flags. (Note: index scan doesn't call it.)


It would be another issue what we should do for the sample scan case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Physical replication slot advance is not persistent

2020-01-29 Thread Michael Paquier
On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote:
> Looks perfect.

Thanks Horiguchi-san and others.  Applied and back-patched down to
11.
--
Michael


signature.asc
Description: PGP signature


Re: standby apply lag on inactive servers

2020-01-29 Thread Kyotaro Horiguchi
At Wed, 29 Jan 2020 19:11:31 -0300, Alvaro Herrera  
wrote in 
> On 2020-Jan-28, Kyotaro Horiguchi wrote:
> 
> > The aim of the patch seem reasonable. XLOG_END_OF_RECOVERY and
> > XLOG_XACT_PREPARE also have a timestamp but it doesn't help much. (But
> > could be included for consistency.)
> 
> Hmm I think I should definitely include those.

I agree to that, given the following change of log messages.

> > The timestamp of a checkpoint record is the start time of a checkpoint
> > (and doesn't have subseconds part, but it's not a problem.). That
> > means that the timestamp is usually far behind the time at the record
> > has been inserted. As the result the stored timestamp can go back by a
> > significant internval. I don't think that causes an actual problem but
> > the movement looks wierd as the result of
> > pg_last_xact_replay_timestamp().
> 
> A problem I see with this is that setting the timestamp is done with
> XLogCtl->lock held; since now we need to make the update conditional,
> we'd have to read the current value, compare with the checkpoint time,
> then set, all with the spinlock held.  That seems way too expensive.
> 
> A compromise might be to do the compare only when it's done for
> checkpoint.  These occur seldom enough that it shouldn't be a problem
> (as opposed to commit records, which can be very frequent).

I think we don't need to that, given the following change.

> > Asides from the backward movement, a timestamp from other than
> > commit/abort records in recvoeryLastXTime affects the following code.
> > 
> > xlog.c:7329: (similar code exists at line 9332)
> > >ereport(LOG,
> > >   (errmsg("redo done at %X/%X",
> > >   (uint32) (ReadRecPtr >> 32), (uint32) 
> > > ReadRecPtr)));
> > >xtime = GetLatestXTime();
> > >if (xtime)
> > >   ereport(LOG,
> > >   (errmsg("last completed transaction was at log 
> > > time %s",
> > >   timestamptz_to_str(xtime;
> > 
> > This code assumes (and the name GetLatestXTime() suggests, I first
> > noticed that here..) that the timestamp comes from commit/abort logs,
> > so otherwise it shows a wrong timestamp.  We shouldn't update the
> > variable by other than that kind of records.
> 
> Thinking about this some more, I think we should do keep the message the
> same backbranches (avoid breaking anything that might be reading the log
> -- a remote but not inexistent possibility), and adjust the message in
> master to be something like "last timestamped WAL activity at time %s",
> and document that it means commit, abort, restore label, checkpoint.

Agreed about backbranches. I'd like to preserve the word "transaction"
as it is more familiar to users. How about something like the follows?

"transactions are completed up to log time %s"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: parens cleanup

2020-01-29 Thread Michael Paquier
On Wed, Jan 29, 2020 at 04:47:19PM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
>> 0002 removes newlines immediately following parens.  These were
>> previously useful because pgindent would move arguments further to the
>> left so that the line would fit under 80 chars.  However, pgindent no
>> longer does that, so the newlines are pointless and ugly.
> 
> +1 except for the changes in zic.c.  Those line breaks are following
> the upstream code, so I'd just put them back in the next merge ...

+1.

>> These being cosmetic cleanups, they're not intended for backpatch,
>> though an argument could be made that doing that would save some future
>> backpatching pain.  If there are sufficient votes for that, I'm open to
>> doing it.  (Of course, 0002 would not be backpatched further back than
>> pg10, the first release that uses the "new" pgindent rules.)
> 
> Meh, -0.1 or so on back-patching.

I am not sure that this is worth a back-patch.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Block level parallel vacuum

2020-01-29 Thread Amit Kapila
On Wed, Jan 29, 2020 at 7:20 PM Masahiko Sawada
 wrote:
>
> >
> > Okay, thanks for the review.  Attached is an updated patch. I have
> > additionally run pgindent.  I am planning to commit the attached
> > tomorrow unless I see more comments.
>
> Thank you for committing it!
>

I have marked this patch as committed in CF.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Do we need to handle orphaned prepared transactions in the server?

2020-01-29 Thread Craig Ringer
On Thu, 30 Jan 2020 at 02:04, Hamid Akhtar  wrote:
>
> So having seen the feedback on this thread, and I tend to agree with most of 
> what has been said here, I also agree that the server core isn't really the 
> ideal place to handle the orphan prepared transactions.
>
> Ideally, these must be handled by a transaction manager, however, I do 
> believe that we cannot let database suffer for failing of an external 
> software, and we did a similar change through introduction of idle in 
> transaction timeout behavior.

The difference, IMO, is that idle-in-transaction aborts don't affect
anything we've promised to be durable.

Once you PREPARE TRANSACTION the DB has made a promise that that txn
is durable. We don't have any consistent feedback channel to back to
applications and say "Hey, if you're not going to finish this up we
need to get rid of it soon, ok?". If a distributed transaction manager
gets consensus for commit and goes to COMMIT PREPARED a previously
prepared txn only to find that it has vanished, that's a major
problem, and one that may bring the entire DTM to a halt until the
admin can intervene.

This isn't like idle-in-transaction aborts. It's closer to something
like uncommitting a previously committed transaction.

I do think it'd make sense to ensure that the documentation clearly
highlights the impact of abandoned prepared xacts on server resource
retention and performance, preferably with pointers to appropriate
views. I haven't reviewed the docs to see how clear that is already.

I can also see an argument for a periodic log message (maybe from
vacuum?) warning when old prepared xacts hold xmin down. Including one
sent to the client application when an explicit VACUUM is executed.
(In fact, it'd make sense to generalise that for all xmin-retention).

But I'm really not a fan of aborting such txns. If you operate with
some kind of broken global transaction manager that can forget or
abandon prepared xacts, then fix it, or adopt site-local periodic
cleanup tasks that understand your site's needs.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise




Re: closesocket behavior in different platforms

2020-01-29 Thread Amit Kapila
On Wed, Jan 29, 2020 at 8:29 PM Robert Haas  wrote:
>
> On Wed, Jan 29, 2020 at 6:04 AM vignesh C  wrote:
> > Thanks for your review and suggestion. I have made a patch based on
> > similar lines. Attached patch has the doc update with the explanation.
> > Thoughts?
>
> Documenting this doesn't seem very useful to me.
>

I thought of documenting it because this has been reported/discussed
multiple times (see some of the links of discussions at the end of the
first email) and every time we need to spend time explaining the same
thing.  However, if we decide not to do that I am fine with it.

> If we could fix the
> code, that would be useful, but otherwise I think I'd just do nothing.
>

Yeah, that is our first choice as well, but there doesn't seem to be a
good solution to it as this is a platform-specific behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: pg_stat_progress_basebackup - progress reporting for pg_basebackup, in the server side

2020-01-29 Thread Kyotaro Horiguchi
At Wed, 29 Jan 2020 23:16:08 +0900, Fujii Masao  
wrote in 
> Hi,
> 
> pg_basebackup reports the backup progress if --progress option is
> specified,
> and we can monitor it in the client side. I think that it's useful if
> we can
> monitor the progress information also in the server side because, for
> example,
> we can easily check that by using SQL. So I'd like to propose
> pg_stat_progress_basebackup view that allows us to monitor the
> progress
> of pg_basebackup, in the server side. Thought?
> 
> POC patch is attached.

| postgres=# \d pg_stat_progress_basebackup
|  View "pg_catalog.pg_stat_progress_basebackup"
|Column|  Type   | Collation | Nullable | Default 
| -+-+---+--+-
|  pid | integer |   |  | 
|  phase   | text|   |  | 
|  backup_total| bigint  |   |  | 
|  backup_streamed | bigint  |   |  | 
|  tablespace_total| bigint  |   |  | 
|  tablespace_streamed | bigint  |   |  | 

I think the view needs client identity such like host/port pair
besides pid (host/port pair fails identify client in the case of
unix-sockets.).  Also elapsed time from session start might be
useful. pg_stat_progress_acuum has datid, datname and relid.

+   if (backup_total > 0 && backup_streamed > backup_total)
+   {
+   backup_total = backup_streamed;

Do we need the condition "backup_total > 0"?


+   int tblspc_streamed = 0;
+
+   pgstat_progress_update_param(PROGRESS_BASEBACKUP_PHASE,
+
PROGRESS_BASEBACKUP_PHASE_STREAM_BACKUP);

This starts "streaming backup" phase with backup_total = 0. Coudln't
we move to that phase after setting backup total and tablespace total?
That is, just after calling SendBackupHeader().

+WHEN 3 THEN 'stopping backup'::text

I'm not sure, but the "stop" seems suggesting the backup is terminated
before completion. If it is following the name of the function
pg_stop_backup, I think the name is suggesting to stop "the state for
performing backup", not a backup.

In the first place, the progress is about "backup" so it seems strange
that we have another phase after the "stopping backup" phase.  It
might be better that it is "finishing file transfer" or such.

   "initializing"
-> "starting file transfer"
-> "transferring files"
-> "finishing file transfer"
-> "transaferring WAL"

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pause recovery if pitr target not reached

2020-01-29 Thread Kyotaro Horiguchi
At Wed, 29 Jan 2020 16:01:46 +0100, Peter Eisentraut 
 wrote in 
> On 2020-01-28 06:01, Kyotaro Horiguchi wrote:
> > The code looks fine, renaming reachedStopPoint to
> > reachedRecoveryTarget looks very nice. Doc part looks fine, too.
> > PostgresNode.pm
> > +Is has_restoring is used, standby mode is used by default.  To use
> > "Is has_restoring used,", or "If has_restoring is used," ?
> 
> Committed with that fix.

Thanks.

> > The change seems aiming not to break compatibility with external test
> > scripts, but it looks quite confusing to me.  The problem here is both
> > enable_streaming/restoring independently put trigger files, so don't
> > we separte placing of trigger files out of the functions?
> 
> Yeah, this is all historically grown, but a major refactoring seems
> out of scope for this thread.  It seems hard to come up with a more
> elegant way, since after all the underlying mechanisms are also all
> intertwined.  Your patch adds even more code, so I'm not sure it's an
> improvement.

Yeah, as I wrote, I thogut that as too much, but I think at least POD
part for the internal-but-externally-used routines would be
needed. Don't we?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: pg_restore crash when there is a failure before all child process is created

2020-01-29 Thread vignesh C
On Wed, Jan 29, 2020 at 6:54 PM Ahsan Hadi  wrote:
>
> Hi Vignesh,
>
> Can you share a test case or steps that you are using to reproduce this 
> issue? Are you reproducing this using a debugger?
>

I could reproduce with the following steps:
Make cluster setup.
Create few tables.
Take a dump in directory format using pg_dump.
Restore the dump generated above using pg_restore with very high
number for --jobs options around 600.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Option to dump foreign data in pg_dump

2020-01-29 Thread vignesh C
On Wed, Jan 29, 2020 at 2:00 PM Luis Carril  wrote:
>
> Thanks for working on the comments. I noticed one behavior is
> different when --table option is specified. When --table is specified
> the following are not getting dumped:
> CREATE SERVER foreign_server
>
> I felt the above also should be included as part of the dump when
> include-foreign-data option is specified.
>
> Yes, it also happens on master. A dump of a foreign table using --table, 
> which only dumps the table definition, does not include the extension nor the 
> server.
> I guess that the idea behind --table is that the table prerequisites should 
> already exist on the database.
>
> A similar behavior can be reproduced for a non foreign table. If a table is 
> created in a specific schema, dumping only the table with --table does not 
> dump the schema definition.
>
> So I think we do not need to dump the server with the table.
>

Thanks for the clarification, the behavior sounds reasonable to me
unless others have a different opinion on this.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-29 Thread Kasahara Tatsuhito
Hi,

On Thu, Jan 30, 2020 at 10:55 AM Kyotaro Horiguchi
 wrote:
> At Wed, 29 Jan 2020 23:24:09 +0900, Fujii Masao  
> wrote in
> > On 2020/01/29 20:06, Kasahara Tatsuhito wrote:
> > > Although I'm not sure this behavior is really problem or not,
> > > it seems to me that previous behavior is more prefer.
> > > Is it worth to apply to HEAD and v12 branch ?
> >
> > I've not read the patch yet, but I agree that updating only seq_scan
> > but not seq_tup_read in Tid Scan sounds strange. IMO at least
> > both should be update together or neither should be updated.
>
> Basically agreed, but sample scan doesn't increment seq_scan but
> increments seq_tup_read.
Yeah, sample scan's behavior is also bit annoying.


> From the view of the view pg_stat_*_tables, the counters moves as follows.
Thanks for your clarification.

> TID scan   : yes , seq_scan, , 
Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
from commit 147e3722f7.

So, currently( v12 and HEAD) TID scan status as following

 increments
scan typetable_beginscan?, per scan, per tuple , SO_TYPE flags
=
TID scan   : yes , seq_scan, , SO_TYPE_SEQSCAN

And my patch change the status to following (same as -v11)

 increments
scan typetable_beginscan?, per scan, per tuple , SO_TYPE flags
=
TID scan   : yes , , , 


> I'd rather think that whatever calls table_beginscan should have
> corresponding SO_TYPE_* flags. (Note: index scan doesn't call it.)
Agreed.
It may be better to add new flag such like SO_TYPE_TIDSCAN,
and handles some statistics updating and other things.
But it may be a bit overkill, since I want to revert to the previous
behavior this time.

Best regards,
-- 
Tatsuhito Kasahara
kasahara.tatsuhito _at_ gmail.com




Re: Tid scan increments value of pg_stat_all_tables.seq_scan. (but not seq_tup_read)

2020-01-29 Thread Kyotaro Horiguchi
At Thu, 30 Jan 2020 13:30:56 +0900, Kasahara Tatsuhito 
 wrote in 
> > TID scan   : yes , seq_scan, , 
> Here is wrong, because TID scan came to have SO_TYPE_SEQSCAN flags
> from commit 147e3722f7.

It is reflectings the discussion below, which means TID scan doesn't
have corresponding SO_TYPE_ value. Currently it is setting
SO_TYPE_SEQSCAN by accedent.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: BufFileRead() error signalling

2020-01-29 Thread Michael Paquier
On Wed, Jan 29, 2020 at 10:01:31AM -0500, Robert Haas wrote:
> Your grep misses one instance of 'read only %d of %d bytes' because
> you grepped for %zu specifically, but that doesn't really change the
> overall picture.

Yes, the one in pg_checksums.c.  That could actually be changed with a
cast to Size.  (Note that there is a second one related to writes but
there is a precedent in md.c, and a similar one in rewriteheap.c..)

Sorry for the digression.
--
Michael


signature.asc
Description: PGP signature


Prevent pg_basebackup running as root

2020-01-29 Thread Ian Barwick
Hi

We encountered an unfortunate case of $SUBJECT the other day where it
would have been preferable to catch the error before rather than after
pg_basebackup ran.

I can't think of any practical reason why pg_basebackup would ever need to
be run as root; we disallow that for initdb, pg_ctl and pg_upgrade, so it
seems reasonable to do the same for pg_basebackup. Trivial patch attached,
which as with the other cases will allow only the --help/--version options
to be executed as root, otherwise nothing else.

The patch doesn't update the pg_basebackup documentation page; we don't
mention it in the pg_ctl and pg_upgrade pages either and it doesn't seem
particularly important to mention it explicitly.

I'll add this to the March CF.


Regards

Ian Barwick


-- 
  Ian Barwick   https://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 238b671f7a..f25a137114 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2077,6 +2077,21 @@ main(int argc, char **argv)
 		}
 	}
 
+	/*
+	 * Disallow running as root, as PostgreSQL will be unable to start
+	 * with root-owned files.
+	 */
+	#ifndef WIN32
+	if (geteuid() == 0)			/* 0 is root's uid */
+	{
+		pg_log_error("cannot be run as root");
+		fprintf(stderr,
+_("Please log in (using, e.g., \"su\") as the (unprivileged) user that will\n"
+  "own the server process.\n"));
+		exit(1);
+	}
+#endif
+
 	atexit(cleanup_directories_atexit);
 
 	while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP",


Re: Prevent pg_basebackup running as root

2020-01-29 Thread Michael Paquier
On Thu, Jan 30, 2020 at 02:29:06PM +0900, Ian Barwick wrote:
> I can't think of any practical reason why pg_basebackup would ever need to
> be run as root; we disallow that for initdb, pg_ctl and pg_upgrade, so it
> seems reasonable to do the same for pg_basebackup. Trivial patch attached,
> which as with the other cases will allow only the --help/--version options
> to be executed as root, otherwise nothing else.

My take on the matter is that we should prevent anything creating or
modifying the data directory to run as root if we finish with
permissions incompatible with what a postmaster expects.  So +1.

> The patch doesn't update the pg_basebackup documentation page; we don't
> mention it in the pg_ctl and pg_upgrade pages either and it doesn't seem
> particularly important to mention it explicitly.

We don't mention that in the docs of pg_rewind either.  Note also that
before 5d5aedd pg_rewind printed an error without exiting :)

> + /*
> +  * Disallow running as root, as PostgreSQL will be unable to start
> +  * with root-owned files.
> +  */

Here is a suggestion:
/*
 * Don't allow pg_basebackup to be run as root, to avoid creating
 * files in the data directory with ownership rights incompatible
 * with the postmaster.  We need only check for root -- any other user
 * won't have sufficient permissions to modify files in the data
 * directory.
 */ 

> + #ifndef WIN32

Indentation here.

> + if (geteuid() == 0) /* 0 is root's uid */
> + {
> + pg_log_error("cannot be run as root");
> + fprintf(stderr,
> + _("Please log in (using, e.g., \"su\") as the 
> (unprivileged) user that will\n"
> +   "own the server process.\n"));
> + exit(1);
> + }
> +#endif

I would recommend to map with the existing message of pg_rewind for
consistency:
  pg_log_error("cannot be executed by \"root\"");
  fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"),
  progname); 

A backpatch could be surprising for some users as that's a behavior
change, so I would recommend not to do a backpatch.
--
Michael


signature.asc
Description: PGP signature


Re: Error message inconsistency

2020-01-29 Thread Mahendra Singh Thalor
On Tue, 28 Jan 2020 at 18:13, Amit Kapila  wrote:
>
> On Sat, Jan 25, 2020 at 10:16 AM Amit Kapila  wrote:
> >
> > On Fri, Jan 24, 2020 at 9:37 PM Tom Lane  wrote:
> > >
> > > Amit Kapila  writes:
> > > > LGTM.  I have combined them into the single patch.  What do we think
> > > > about backpatching this?
> > >
> > > No objection to the patch for HEAD, but it does not seem like
> > > back-patch material: it is not fixing a bug.
> > >
> >
> > Okay, I will commit this early next week (by Tuesday).
> >
>
> Pushed.
>

Thank you for committing it!

-- 
Thanks and Regards
Mahendra Singh Thalor
EnterpriseDB: http://www.enterprisedb.com




Re: making the backend's json parser work in frontend code

2020-01-29 Thread Andrew Dunstan
On Thu, Jan 30, 2020 at 7:36 AM Mark Dilger
 wrote:
>
>
>
> > On Jan 29, 2020, at 1:02 PM, Andrew Dunstan 
> >  wrote:
> >
> > On Wed, Jan 29, 2020 at 4:32 PM Andrew Dunstan
> >  wrote:
> >>
> >>
> >> On 1/28/20 5:28 PM, Mark Dilger wrote:
> >>>
> >>>
>  +# There doesn't seem to be any easy way to get TestLib to use the 
>  binaries from
>  +# our directory, so we hack up a path to our binary and run that
>  directly.  This
>  +# seems brittle enough that some other solution should be found, if 
>  possible.
>  +
>  +my $test_json = join('/', $ENV{TESTDIR}, 'test_json');
> 
>  I don't know what the right thing to do here is. Perhaps someone more
>  familiar with TAP testing can comment.
> >>> Yeah, I was hoping that might get a comment from Andrew.  I think if it 
> >>> works as-is on windows, we could just use it this way until it causes a 
> >>> problem on some platform or other.  It’s not a runtime issue, being only 
> >>> a build-time test, and only then when tap tests are enabled *and* running 
> >>> check-world, so nobody should really be adversely affected.  I’ll likely 
> >>> get around to testing this on Windows, but I don’t have any Windows 
> >>> environments set up yet, as that is still on my todo list.
> >>>
> >>
> >>
> >> I think using TESTDIR is Ok,
> >
> >
> > I've changed my mind, I don't think that will work for MSVC, the
> > executable gets built elsewhere for that. I'll try to come up with
> > something portable.
>
> I’m just now working on getting my Windows VMs set up with Visual Studio and 
> whatnot, per the wiki instructions, so I don’t need to burden you with this 
> sort of Windows task in the future.  If there are any gotchas not mentioned 
> on the wiki, I’d appreciate pointers about how to avoid them.  I’ll try to 
> help devise a solution, or test what you come up with, once I’m properly set 
> up for that.
>
> For no particular reason, I chose Windows Server 2019 and Windows 10 Pro.
>


One VM should be sufficient. Either W10Pro os WS2019 would be fine. I
have buildfarm animals running on both.

Here's what I got working after a lot of trial and error. (This will
require a tiny change in the buildfarm script to make the animals test
it). Note that there is one test that I couldn't get working, so I
skipped it. If you can find out why it fails so much the better ... it
seems to be related to how the command processor handles quotes.

cheers

andrew


-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/Makefile b/src/Makefile
index bcdbd9588a..ccd4bab0de 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -2,7 +2,8 @@
 #
 # Makefile for src
 #
-# Copyright (c) 1994, Regents of the University of California
+# Portions Copyright (c) 1994, Regents of the University of California
+# Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
 #
 # src/Makefile
 #
@@ -27,6 +28,7 @@ SUBDIRS = \
 	bin \
 	pl \
 	makefiles \
+	test/bin \
 	test/regress \
 	test/isolation \
 	test/perl
diff --git a/src/test/Makefile b/src/test/Makefile
index efb206aa75..6c3a1d4c27 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = perl regress isolation modules authentication recovery subscription
+SUBDIRS = bin perl regress isolation modules authentication recovery subscription
 
 # Test suites that are not safe by default but can be run if selected
 # by the user via the whitespace-separated list in variable
@@ -40,10 +40,10 @@ endif
 ALWAYS_SUBDIRS = $(filter-out $(SUBDIRS),examples kerberos ldap locale thread ssl)
 
 # We want to recurse to all subdirs for all standard targets, except that
-# installcheck and install should not recurse into the subdirectory "modules".
+# installcheck and install should not recurse into "modules" or "bin"
 
 recurse_alldirs_targets := $(filter-out installcheck install, $(standard_targets))
-installable_dirs := $(filter-out modules, $(SUBDIRS))
+installable_dirs := $(filter-out modules bin, $(SUBDIRS))
 
 $(call recurse,$(recurse_alldirs_targets))
 $(call recurse,installcheck, $(installable_dirs))
diff --git a/src/test/bin/.gitignore b/src/test/bin/.gitignore
new file mode 100644
index 00..6709c749d8
--- /dev/null
+++ b/src/test/bin/.gitignore
@@ -0,0 +1 @@
+test_json
diff --git a/src/test/bin/Makefile b/src/test/bin/Makefile
new file mode 100644
index 00..113ce04cba
--- /dev/null
+++ b/src/test/bin/Makefile
@@ -0,0 +1,43 @@
+#-
+#
+# Makefile for src/test/bin
+#
+# Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/bin/bin/Makefile
+#
+#-
+
+PGFILE

Re: Hash join not finding which collation to use for string hashing

2020-01-29 Thread Amit Langote
Hi Mark,

On Wed, Jan 29, 2020 at 1:03 PM Mark Dilger
 wrote:
> > On Jan 28, 2020, at 7:38 PM, Mark Dilger  
> > wrote:
> >> Mark Dilger  writes:
> >>> While reviewing the partition-wise join patch, I ran into an issue that 
> >>> exists in master, so rather than responding to that patch, I’m starting 
> >>> this new thread.
> >>> I noticed that this seems similar to the problem that was supposed to 
> >>> have been fixed in the "Re: COLLATE: Hash partition vs UPDATE” thread.  
> >>> As such, I’ve included Tom and Amit in the CC list.

Just to clarify, we only intended in the quoted thread to plug
relevant holes of the *partitioning* code, which IIRC was more
straightforward to do than appears to be the case here.

> If the answer here is just that you’d rather it always fail at planning time 
> because that’s more deterministic than having it sometimes succeed and 
> sometimes fail at runtime depending on which data has been loaded, ok, I can 
> understand that.  If so, then let’s put this error string into the docs, 
> because right now, if you google
>
> site:postgresql.org "could not determine which collation to use for 
> string hashing”
>
> you don’t get anything from the docs telling you that this is an expected 
> outcome.

You may have noticed that it's not only hash join that bails out:

EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
 QUERY PLAN

 Hash Join
   Hash Cond: ((t2.a)::text = (t1.a)::text)
   ->  Seq Scan on beta t2
   ->  Hash
 ->  Seq Scan on alpha t1
   Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string hashing
HINT:  Use the COLLATE clause to set the collation explicitly.

SET enable_hashjoin TO off;
-- error occurs partway through ExecInitMergeJoin(), so EXPLAIN can't finish
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

SET enable_mergejoin TO off;
EXPLAIN (COSTS OFF) SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2
ON (t1.a = t2.a) WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
 QUERY PLAN

 Nested Loop
   Join Filter: ((t1.a)::text = (t2.a)::text)
   ->  Seq Scan on beta t2
   ->  Materialize
 ->  Seq Scan on alpha t1
   Filter: (a = ANY ('{äbç,ὀδυσσεύς}'::text[]))
(6 rows)

SELECT t1.a, t2.a FROM alpha t1 INNER JOIN beta t2 ON (t1.a = t2.a)
WHERE t1.a IN ('äbç', 'ὀδυσσεύς');
ERROR:  could not determine which collation to use for string comparison
HINT:  Use the COLLATE clause to set the collation explicitly.

With PG 11, I can see that hash join and nestloop join work.  But with
PG 12, this join can't possible work without an explicit COLLATE
clause.  So it would be nice if we can report a more specific error
much sooner, possibly with some parser context, given that we now know
for sure that a join qual without a collation assigned will not work
at all.  IOW, maybe we should aim for making the run-time collation
errors to be of "won't happen" category as much as possible.

Tom said:
> >> Now, I'd be the first to agree that this error could be reported better.
> >> The parser knows that it couldn't resolve a collation for t1.a = t2.a, but
> >> what it does *not* know is whether the '=' operator cares for collation.
> >> Throwing an error when the operator wouldn't care at runtime isn't going
> >> to make many people happy.  On the other hand, when the operator finally
> >> does run and can't get a collation, all it knows is that it didn't get a
> >> collation, not why.  So we can't produce an error message as specific as
> >> "ja_JP and tr_TR collations conflict".
> >>
> >> Now that the collations feature has settled in, it'd be nice to go back
> >> and see if we can't improve that somehow.  Not sure how.

Would it make sense to catch a qual with unassigned collation
somewhere in the planner, where the qual's operator family is
estatblished, by checking if the operator family behavior is sensitive
to collations?

Thanks,
Amit




Re: Prevent pg_basebackup running as root

2020-01-29 Thread Ian Barwick
2020年1月30日(木) 14:57 Michael Paquier :
>
> On Thu, Jan 30, 2020 at 02:29:06PM +0900, Ian Barwick wrote:
> > I can't think of any practical reason why pg_basebackup would ever need to
> > be run as root; we disallow that for initdb, pg_ctl and pg_upgrade, so it
> > seems reasonable to do the same for pg_basebackup. Trivial patch attached,
> > which as with the other cases will allow only the --help/--version options
> > to be executed as root, otherwise nothing else.
>
> My take on the matter is that we should prevent anything creating or
> modifying the data directory to run as root if we finish with
> permissions incompatible with what a postmaster expects.  So +1.
>
> > The patch doesn't update the pg_basebackup documentation page; we don't
> > mention it in the pg_ctl and pg_upgrade pages either and it doesn't seem
> > particularly important to mention it explicitly.
>
> We don't mention that in the docs of pg_rewind either.  Note also that
> before 5d5aedd pg_rewind printed an error without exiting :)

Ouch.

> > + /*
> > +  * Disallow running as root, as PostgreSQL will be unable to start
> > +  * with root-owned files.
> > +  */
>
> Here is a suggestion:
> /*
>  * Don't allow pg_basebackup to be run as root, to avoid creating
>  * files in the data directory with ownership rights incompatible
>  * with the postmaster.  We need only check for root -- any other user
>  * won't have sufficient permissions to modify files in the data
>  * directory.
>  */

I think we can skip the second sentence altogether. It'd be theoretically
easy enough to up with some combination of group permissions,
sticky bits, umask, ACL settings etc/ which would allow one user to
modify the files owned by another user,

> > + #ifndef WIN32
>
> Indentation here.

Whoops, that's what comes from typing on the train ;)

> > + if (geteuid() == 0) /* 0 is root's uid */
> > + {
> > + pg_log_error("cannot be run as root");
> > + fprintf(stderr,
> > + _("Please log in (using, e.g., \"su\") as the 
> > (unprivileged) user that will\n"
> > +   "own the server process.\n"));
> > + exit(1);
> > + }
> > +#endif
>
> I would recommend to map with the existing message of pg_rewind for
> consistency:
>   pg_log_error("cannot be executed by \"root\"");
>   fprintf(stderr, _("You must run %s as the PostgreSQL superuser.\n"),
>   progname);

Hmm, I was using the existing message from initdb and pg_ctl for consistency:

src/bin/initdb/initdb.c:

if (geteuid() == 0)/* 0 is root's uid */
{
pg_log_error("cannot be run as root");
fprintf(stderr,
_("Please log in (using, e.g., \"su\") as the
(unprivileged) user that will\n"
  "own the server process.\n"));
exit(1);
}

src/bin/pg_ctl/pg_ctl.c:

if (geteuid() == 0)
{
write_stderr(_("%s: cannot be run as root\n"
   "Please log in (using, e.g., \"su\") as the "
   "(unprivileged) user that will\n"
   "own the server process.\n"),
 progname);
exit(1);
}

src/bin/pg_upgrade/option.c:

if (os_user_effective_id == 0)
pg_fatal("%s: cannot be run as root\n", os_info.progname);

I wonder if it would be worth settling on a common message and way of emitting
it, each utility does it slightly differently.

> A backpatch could be surprising for some users as that's a behavior
> change, so I would recommend not to do a backpatch.

Agreed.


Regards

Ian Barwick
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 238b671f7a..dfa816cbae 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2077,6 +2077,22 @@ main(int argc, char **argv)
 		}
 	}
 
+	/*
+	 * Don't allow pg_basebackup to be run as root, to avoid creating
+	 * files in the data directory with ownership rights incompatible
+	 * with the postmaster.
+	 */
+#ifndef WIN32
+	if (geteuid() == 0)			/* 0 is root's uid */
+	{
+		pg_log_error("cannot be run as root");
+		fprintf(stderr,
+_("Please log in (using, e.g., \"su\") as the (unprivileged) user that will\n"
+  "own the server process.\n"));
+		exit(1);
+	}
+#endif
+
 	atexit(cleanup_directories_atexit);
 
 	while ((c = getopt_long(argc, argv, "CD:F:r:RS:T:X:l:nNzZ:d:c:h:p:U:s:wWkvP",


Re: Enabling B-Tree deduplication by default

2020-01-29 Thread Peter Geoghegan
On Wed, Jan 29, 2020 at 11:50 AM Peter Geoghegan  wrote:
> I should stop talking about it for now, and go back to reassessing the
> extent of the regression in highly unsympathetic cases. The patch has
> become faster in a couple of different ways since I last looked at
> this question, and it's entirely possible that the regression is even
> smaller than it was before.

I revisited the insert benchmark as a way of assessing the extent of
the regression from deduplication with a very unsympathetic case.
Background:

https://smalldatum.blogspot.com/2017/06/insert-benchmark-in-memory-intel-nuc.html

https://github.com/mdcallag/mytools/blob/master/bench/ibench/iibench.py

This workload consists of serial inserts into a table with a primary
key, plus three additional non-unique indexes. A low concurrency
benchmark seemed more likely to be regressed by the patch, so that's
what I focussed on. The indexes used have very few duplicates, and so
don't benefit from deduplication at all, with the exception of the
purchases_index_marketsegment index, which is a bit smaller (see log
files for precise details). The table (which is confusingly named
"purchases_index") had a total of three non-unique indexes, plus a
standard serial primary key. We insert 100,000,000 rows in total,
which takes under 30 minutes in each case. There are no reads, and no
updates or deletes.

There is a regression that is just shy of 2% here, as measured in
insert benchmark "rows/sec" -- this metric goes from "62190.0"
rows/sec on master to "60986.2 rows/sec" with the patch. I think that
this is an acceptable price to pay for the benefits -- this is a small
regression for a particularly unfavorable case. Also, I suspect that
this result is still quite a bit better than what you'd get with
either InnoDB or MyRocks on the same hardware (these systems were the
original targets of the insert benchmark, which was only recently
ported over to Postgres). At least, Mark Callaghan reports getting
only about 40k rows/sec inserted in 2017 with roughly comparable
hardware and test conditions (we're both running with
synchronous_commit=off, or the equivalent). We're paying a small cost
in an area where Postgres can afford to take a hit, in order to gain a
much larger benefit in an area where Postgres is much less
competitive.

I attach detailed output from runs for both master and patch.

The shell script that I used to run the benchmark is as follows:

#!/bin/sh
psql -c "create database test;"

cd $HOME/code/mytools/bench/ibench
python2 iibench.py --dbms=postgres --setup | tee iibench-output.log
python2 iibench.py --dbms=postgres --max_rows=1 | tee -a
iibench-output.log
psql -d test -c "SELECT pg_relation_size(oid),
pg_size_pretty(pg_relation_size(oid)),
relname FROM pg_class WHERE relnamespace = 'public'::regnamespace
ORDER BY 1 DESC LIMIT 15;" | tee -a iibench-output.log

-- 
Peter Geoghegan


master-iibench-output.log
Description: Binary data


patch-iibench-output.log
Description: Binary data


Re: Prevent pg_basebackup running as root

2020-01-29 Thread Michael Paquier
On Thu, Jan 30, 2020 at 03:38:54PM +0900, Ian Barwick wrote:
> 2020年1月30日(木) 14:57 Michael Paquier :

I have never noticed that your client was configured in Japanese :)

> I think we can skip the second sentence altogether. It'd be theoretically
> easy enough to up with some combination of group permissions,
> sticky bits, umask, ACL settings etc/ which would allow one user to
> modify the files owned by another user,

Okay, fine by me.

> Hmm, I was using the existing message from initdb and pg_ctl for consistency:

Ahh, indeed.  pg_rewind has inherited its message from pg_resetwal.

> I wonder if it would be worth settling on a common message and way of emitting
> it, each utility does it slightly differently.

Not sure that's a good idea.  Each tool has its own properties, so it
is good to keep some flexibility in the error message produced.

Anyway, your patch looks like a good idea to me, so let's see if
others have opinions or objections about it.
--
Michael


signature.asc
Description: PGP signature


Parallel CREATE INDEX vs DSM starvation

2020-01-29 Thread Thomas Munro
Hello,

CreateParallelContext() can return a context with seg == NULL.  That
causes CREATE INDEX to segfault.  Instead, it should fall back to
non-parallel build.  See attached.

This probably explains a segfault reported over on pgsql-general[1].

[1] 
https://www.postgresql.org/message-id/CA%2BhUKG%2BcfjHy63hXEOc-CRZEPcUhu9%3DP3gKk_W9OiXzj-dfV_g%40mail.gmail.com


0001-Handle-lack-of-DSM-segment-slots-in-parallel-btree-b.patch
Description: Binary data


Re: Parallel CREATE INDEX vs DSM starvation

2020-01-29 Thread Peter Geoghegan
On Wed, Jan 29, 2020 at 11:38 PM Thomas Munro  wrote:
> CreateParallelContext() can return a context with seg == NULL.  That
> causes CREATE INDEX to segfault.  Instead, it should fall back to
> non-parallel build.  See attached.

I guess we can't call _bt_end_parallel() here. So your patch LGTM.

Thanks
-- 
Peter Geoghegan