date:20190319

Re: Proposal to suppress errors thrown by to_reg*()

2019-03-19 Thread Kyotaro HORIGUCHI

Hello.

At Thu, 14 Mar 2019 13:37:00 +0900, Takuma Hoshiai  wrote 
in <20190314133700.c271429ddc00ddab3aac2...@sraoss.co.jp>
> Hi, hackers,
> 
> According to the document, "to_reg* functions return null rather than
> throwing an error if the name is not found", but this is not the case
> if the arguments to those functions are schema qualified and the
> caller does not have access permission of the schema even if the table
> (or other object) does exist -- we get an error.

You explicitly specified the namespace and I think that it is not
the case of not-found. It is right that the error happens since
you explicitly tried to access a unprivileged schema.

> For example, to_regclass() throws an error if its argument is
> 'schema_name.table_name'' (i.e. contains schema name) and caller's
> role doesn't have access permission of the schema. Same thing can be
> said to Other functions,
> 
> I get complain from Pgpool-II users because it uses to_regclass()
> internally to confirm table's existence but in the case above it's
> not useful because the error aborts user's transaction.

I'm not sure how such unaccessible table names are given to the
function there, but it is also natural that any user trying to
access unprivileged objects gets an error.

> To be more consistent with the doc and to make those functions more
> useful, I propose to change current implementation so that they do not
> throw an error if the name space cannot be accessible by the caller.

Since it doesn't seem a bug, I think that changing the existing
behavior is not acceptable. Maybe we can live with another
signature of the function like to_regproc(name text, noerror
bool).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: Problem with default partition pruning

2019-03-19 Thread Yuzuko Hosoya

Hi Amit-san,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
Sent: Monday, March 18, 2019 6:44 PM
 
> Hosoya-san,
> 
> On 2019/03/15 15:05, Yuzuko Hosoya wrote:
> > Indeed, it's problematic.  I also did test and I found that this
> > problem was occurred when any partition didn't match WHERE clauses.
> > So following query didn't work correctly.
> >
> > # explain select * from test1_3 where (id > 0 and id < 30);
> >QUERY PLAN
> > -
> >  Append  (cost=0.00..58.16 rows=12 width=36)
> >->  Seq Scan on test1_3_1  (cost=0.00..29.05 rows=6 width=36)
> >  Filter: ((id > 0) AND (id < 30))
> >->  Seq Scan on test1_3_2  (cost=0.00..29.05 rows=6 width=36)
> >  Filter: ((id > 0) AND (id < 30))
> > (5 rows)
> >
> > I created a new patch to handle this problem, and confirmed the query
> > you mentioned works as expected
> >
> > # explain select * from test1 where (id > 0 and id < 30) or (id > 220 and 
> > id < 230);
> > QUERY PLAN
> > --
> > -  Append  (cost=0.00..70.93 rows=26 width=36)
> >->  Seq Scan on test1_1_1  (cost=0.00..35.40 rows=13 width=36)
> >  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id < 230)))
> >->  Seq Scan on test1_3_1  (cost=0.00..35.40 rows=13 width=36)
> >  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id <
> > 230)))
> > (5 rows)
> >
> > v2 patch attached.
> > Could you please check it again?
> 
> I think the updated patch breaks the promise that get_matching_range_bounds 
> won't set scan_default
> based on individual pruning value comparisons.  How about the attached delta 
> patch that applies on
> top of your earlier v1 patch, which fixes the issue reported by Thibaut?
> 
Indeed.  I agreed with your proposal.
Also, I confirmed your patch works correctly.

Best regards,
Yuzuko Hosoya

Re: [PATCH] src/test/modules/dummy_index -- way to test reloptions from inside of access method

2019-03-19 Thread Michael Paquier

On Mon, Mar 18, 2019 at 10:41:13PM +0300, Nikolay Shaplov wrote:
> So I created src/test/modules/dummy_index, it does no real indexing, but it 
> has all types of reloptions that can be set (reloption_int, reloption_real, 
> reloption_bool, reloption_string and reloption_string2). It also has set of 
> boolean GUC variables that enables test output concerning certain reloption:
> (do_test_reloption_int, do_test_reloption_real, do_test_reloption_bool and 
> do_test_reloption_string and  do_test_reloption_string2) also set 
> do_test_reloptions to true to get any output at all.
> Dummy index will print this output when index is created, and when record is 
> inserted (this needed to check if your ALTER TABLE did well)
> Then you just use normal regression tests: turns on test output, sets some 
> reloption and check test output, that it properly reaches the access method 
> internals.

Thanks for doing the effort to split that stuff.  This looks like an
interesting base template for anybody willing to look after some
basics with index AMs, like what's done for FDWs with blackhole_fdw.
Perhaps the name should be dummy_am_index or dummy_index_am?
dummy_index does not sound bad either.

> While writing this module I kept in mind the idea that this module can be 
> also 
> used for other am-related tests, so I separated the code into two parts: 
> dummy_index.c has only code related to implementation of an empty access 
> method, and all code related to reloptions tests  were stored into 
> direloptions.c. So in future somebody can add di[what_ever_he_wants].c whith 
> his own tests code, add necessary calls to dummy_index.c, create some GUC 
> variables, and has his own feature tested.

Here are some comments.  I think that this could be simplified
further more.

The README file could have a more consistent format with the rest.
See for example dummy_seclabel/README.  You could add a small
example with its usage.

Is there any point in having string_option2?  String reloptions are
already tested with string_option.  Also => s/Seconf/Second/.

s/valudate/validate/.

+-- Test behavior of second string option (there can be issues with second one)
What are those issues?

+   } else
+   {
Code format does not follow the Postgres guidelines.  You could fix
all that with an indent run.

The ranges of the different values are not tested, wouldn't it be
better to test that as well?

The way the test is configured with the strong dependencies between
the reloption types and the GUCs is much bug-prone I think.  All of
that is present only to print a couple of WARNING messages with
specific values of values.  So, why not removing the GUCs and the
printing logic which shows a subset of values?  Please note that these
are visible directly via pg_class.reloptions.  So we could shave quite
some code.

Please note that the compilation of the module fails.
nodes/relation.h maybe is access/relation.h?  You may want to review
all that.
--
Michael

signature.asc
Description: PGP signature

Re: pg_basebackup ignores the existing data directory permissions

2019-03-19 Thread Haribabu Kommi

On Tue, Mar 19, 2019 at 5:29 PM Michael Paquier  wrote:

> On Mon, Mar 18, 2019 at 11:45:05AM -0400, Robert Haas wrote:
> > So you want to default to no group access regardless of the directory
> > permissions, with an option to enable group access that must be
> > explicitly specified?  That seems like a reasonable option to me; note
> > that initdb does seem to chdir() an existing directory.
>
> Hm.  We have been assuming that the contents of a base backup inherit
> the permission of the source when using pg_basebackup because this
> allows users to keep a nodes in a consistent state without deciding
> which option to use.  Do you mean that you would like to enforce the
> permissions of only the root directory if it exists?  Or the root
> directory with all its contents?  The former may be fine.  The latter
> is definitely not.
>

As per my understanding going through the discussion, the option is for
root directory with all its contents also.

How about the following change?

pg_basebackup  --> copies the contents of the src directory (with group
access)
and even the root directory permissions.

pg_basebackup --no-group-access   --> copies the contents of the src
directory
(with no group access) even for the root directory.

So the default behavior works for many people, others that needs restrict
behavior
can use the new option.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Proposal to suppress errors thrown by to_reg*()

2019-03-19 Thread Tatsuo Ishii

>> According to the document, "to_reg* functions return null rather than
>> throwing an error if the name is not found", but this is not the case
>> if the arguments to those functions are schema qualified and the
>> caller does not have access permission of the schema even if the table
>> (or other object) does exist -- we get an error.
> 
> You explicitly specified the namespace and I think that it is not
> the case of not-found. It is right that the error happens since
> you explicitly tried to access a unprivileged schema.
> 
>> For example, to_regclass() throws an error if its argument is
>> 'schema_name.table_name'' (i.e. contains schema name) and caller's
>> role doesn't have access permission of the schema. Same thing can be
>> said to Other functions,
>> 
>> I get complain from Pgpool-II users because it uses to_regclass()
>> internally to confirm table's existence but in the case above it's
>> not useful because the error aborts user's transaction.
> 
> I'm not sure how such unaccessible table names are given to the
> function there, but it is also natural that any user trying to
> access unprivileged objects gets an error.

You misunderstand the functionality of to_regclass(). Even if a user
does not have an access privilege of certain table, to_regclass() does
not raise an error.

test=> select * from t1;
ERROR:  permission denied for table t1

test=> select to_regclass('t1')::oid;
 to_regclass 
-
 1647238
(1 row)

So why can't we do the same thing for schema? For me, that way seems
to be more consistent.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-19 Thread Jamison, Kirk

Hi Hari-san,

On Sunday, February 10, 2019 2:25 PM (GMT+9), Haribabu Kommi wrote:
> I try to fix it by adding a check for parallel worker or not and based on it
> count them into stats. Patch attached.
>
> With this patch, currently it doesn't count parallel worker transactions, and
> rest of the stats like seqscan and etc are still get counted. IMO they still
> may need to be counted as those stats represent the number of tuples
> returned and etc.
>
> Comments?

I took a look at your patch, and it’s pretty straightforward.
However, currently the patch does not apply, so I reattached an updated one
to keep the CFbot happy.

The previous patch also had a missing header to detect parallel workers
so I added that. --> #include "access/parallel.h"

Regards,
Kirk Jamison

0001-Avoid-counting-parallel-worker-transactions-stats_v2.patch
Description: 0001-Avoid-counting-parallel-worker-transactions-stats_v2.patch

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-19 Thread Haribabu Kommi

On Tue, Mar 19, 2019 at 2:32 PM Rahila Syed 
wrote:

> Hi Haribabu,
>
> The latest patch fails while applying header files part. Kindly rebase.
>

Thanks for the review.


> The patch looks good to me. However, I wonder what are the other scenarios
> where xact_commit is incremented because even if I commit a single
> transaction with your patch applied the increment in xact_commit is > 1. As
> you mentioned upthread, need to check what internal operation resulted in
> the increase. But the increase in xact_commit is surely lesser with the
> patch.
>

Currently, the transaction counts are incremented by the background process
like autovacuum and checkpointer.
when turn the autovacuum off, you can see exactly how many transaction
commits increases.


> postgres=# BEGIN;
> BEGIN
> postgres=# select xact_commit from pg_stat_database where datname =
> 'postgres';
>  xact_commit
> -
>  158
> (1 row)
>
> postgres=# explain analyze select * from foo where i = 1000;
>QUERY
> PLAN
>
> 
>  Gather  (cost=1000.00..136417.85 rows=1 width=37) (actual
> time=4.596..3792.710 rows=1 loops=1)
>Workers Planned: 2
>Workers Launched: 2
>->  Parallel Seq Scan on foo  (cost=0.00..135417.75 rows=1 width=37)
> (actual time=2448.038..3706.009 rows=0 loops=3)
>  Filter: (i = 1000)
>  Rows Removed by Filter: 333
>  Planning Time: 0.353 ms
>  Execution Time: 3793.572 ms
> (8 rows)
>
> postgres=# commit;
> COMMIT
> postgres=# select xact_commit from pg_stat_database where datname =
> 'postgres';
>  xact_commit
> -
>  161
> (1 row)
>

Thanks for the test and confirmation.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-03-19 Thread Haribabu Kommi

On Tue, Mar 19, 2019 at 6:47 PM Jamison, Kirk 
wrote:

> Hi Hari-san,
>
>
>
> On Sunday, February 10, 2019 2:25 PM (GMT+9), Haribabu Kommi wrote:
>
> > I try to fix it by adding a check for parallel worker or not and based
> on it
>
> > count them into stats. Patch attached.
>
> >
>
> > With this patch, currently it doesn't count parallel worker
> transactions, and
>
> > rest of the stats like seqscan and etc are still get counted. IMO they
> still
>
> > may need to be counted as those stats represent the number of tuples
>
> > returned and etc.
>
> >
>
> > Comments?
>
>
>
> I took a look at your patch, and it’s pretty straightforward.
>
> However, currently the patch does not apply, so I reattached an updated one
>
> to keep the CFbot happy.
>
>
>
> The previous patch also had a missing header to detect parallel workers
>
> so I added that. --> #include "access/parallel.h"
>

Thanks for update and review krik.

Regards,
Haribabu Kommi
Fujitsu Australia

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 10?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 13:31:04 +0900, Masahiko Sawada  
wrote in 
> > For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave
> > almost the same. I suspect that the indexes are too-small and all
> > the index pages were on memory and CPU is saturated. Maybe you
> > had four cores and parallel workers more than the number had no
> > effect.  Other normal backends should have been able do almost
> > nothing meanwhile. Usually the number of parallel workers is
> > determined so that IO capacity is filled up but this feature
> > intermittently saturates CPU capacity very under such a
> > situation.
> >
> 
> I'm sorry I didn't make it clear enough. If the parallel degree is
> higher than 'the number of indexes - 1' redundant workers are not
> launched. So for indexes=4, 8, 16 the number of actually launched
> parallel workers is up to 3, 7, 15 respectively. That's why the result
> shows almost the same execution time in the cases where nindexes <=
> parallel_degree.

In the 16 indexes case, the performance saturated at 4 workers
which contradicts to your explanation.

> I'll share the performance test result of more larger tables and indexes.
> 
> > I'm not sure, but what if we do index vacuum in one-tuple-by-one
> > manner? That is, heap vacuum passes dead tuple one-by-one (or
> > buffering few tuples) to workers and workers process it not by
> > bulkdelete, but just tuple_delete (we don't have one). That could
> > avoid the sleep time of heap-scan while index bulkdelete.
> >
> 
> Just to be clear, in parallel lazy vacuum all parallel vacuum
> processes including the leader process do index vacuuming, no one
> doesn't sleep during index vacuuming. The leader process does heap
> scan and launches parallel workers before index vacuuming. Each
> processes exclusively processes indexes one by one.

The leader doesn't continue heap-scan while index vacuuming is
running. And the index-page-scan seems eat up CPU easily. If
index vacuum can run simultaneously with the next heap scan
phase, we can make index scan finishes almost the same time with
the next round of heap scan. It would reduce the (possible) CPU
contention. But this requires as the twice size of shared
memoryas the current implement.

> Such index deletion method could be an optimization but I'm not sure
> that the calling tuple_delete many times would be faster than one
> bulkdelete. If there are many dead tuples vacuum has to call
> tuple_delete as much as dead tuples. In general one seqscan is faster
> than tons of indexscan. There is the proposal for such one by one
> index deletions[1] but it's not a replacement of bulkdelete.

I'm not sure what you mean by 'replacement' but it depends on how
large part of a table is removed at once. As mentioned in the
thread. But unfortunately it doesn't seem easy to do..

> > > Attached the updated version patches. The patches apply to the current
> > > HEAD cleanly but the 0001 patch still changes the vacuum option to a
> > > Node since it's under the discussion. After the direction has been
> > > decided, I'll update the patches.
> >
> > As for the to-be-or-not-to-be a node problem, I don't think it is
> > needed but from the point of consistency, it seems reasonable and
> > it is seen in other nodes that *Stmt Node holds option Node. But
> > makeVacOpt and it's usage, and subsequent operations on the node
> > look somewhat strange.. Why don't you just do
> > "makeNode(VacuumOptions)"?
> 
> Thank you for the comment but this part has gone away as the recent
> commit changed the grammar production of vacuum command.

Oops!


> > >+  /* Estimate size for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_TUPLES 
> > >*/
> > >+maxtuples = compute_max_dead_tuples(nblocks, nindexes > 0);
> >
> > If I understand this correctly, nindexes is always > 1 there. At
> > lesat asserted that > 0 there.
> >
> > >+  estdt = MAXALIGN(add_size(sizeof(LVDeadTuples),
> >
> > I don't think the name is good. (dt menant detach by the first look for 
> > me..)
> 
> Fixed.
> 
> >
> > >+if (lps->nworkers_requested > 0)
> > >+appendStringInfo(&buf,
> > >+ ngettext("launched %d parallel vacuum worker 
> > >for index cleanup (planned: %d, requested %d)",
> >
> > "planned"?
> 
> The 'planned' shows how many parallel workers we planned to launch.
> The degree of parallelism is determined based on either user request
> or the number of indexes that the table has.
> 
> >
> >
> > >+/* Get the next index to vacuum */
> > >+if (do_parallel)
> > >+idx = pg_atomic_fetch_add_u32(&(lps->lvshared->nprocessed), 
> > >1);
> > >+else
> > >+idx = nprocessed++;
> >
> > It seems that both of the two cases can be handled using
> > LVParallelState and most of the branches by lps or do_parallel
> > can be removed.
> >
> 
> Sorry I couldn't get your comment. You meant to move nprocessed to
> LVParallelState?

Exactly. I meant letting lvshared points to priva

Re: DNS SRV support for LDAP authentication

2019-03-19 Thread Thomas Munro

On Sat, Feb 16, 2019 at 10:57 PM Thomas Munro
 wrote:
> Yeah.  This coding is ugly and StringInfo would be much nicer.
> Thinking about that made me realise that the proposed SRV case should
> also handle multiple SRV records by building a multi-URL string too
> (instead of just taking the first one).  I will make it so.

Done, in the attached.  Reviewing your comments again, from the top:

On Sat, Feb 2, 2019 at 12:48 AM Daniel Gustafsson  wrote:
> + If PostgreSQL was compiled with OpenLDAP as
>
> Should OpenLDAP be wrapped in  tags as well?  If so, there is
> another “bare” instance in client-auth.sgml which perhaps can be wrapped into
> this patch while at it.

Fixed.

> +   ereport(LOG,
> +   (errmsg("could not look up a hostlist for %s",
> +   domain)));
>
> Should this be \”%s\”?

Yep, fixed.

> +   new_uris = psprintf("%s%s%s://%s:%d",
>
> While this construction isn't introduced in this patch, would it not make 
> sense
> to convert uris to StringInfo instead to improve readability?

Agreed, fixed.

> +   /* Step over this hostname and any spaces. */
>
> Nitpicking on a moved hunk, but single-line comments shouldn’t end in a period
> I believe.

Huh. And yet they are sentences.

tmunro@dogmatix $ git grep '/\* [A-Za-z].*\. \*/' | wc -l
5607
tmunro@dogmatix $ git grep '/\* [A-Za-z].*[a-z] \*/' | wc -l
   59500

Yep, you win!

I also fixed a bug where some error messages could pass a NULL pointer
for %s when we don't have a server name.

I also added a hint to the error message you get if it can't find DNS
SRV records, so that if you accidentally activate this feature by
forgetting to set the server name, it'll remind you that you could do
that:

 LOG:  LDAP authentication could not find DNS SRV records for "example.net"
 HINT:  Set an LDAP server name explicitly.

Unfortunately, no feedback from MS Active Directory users has been
forthcoming, but I guess that might take a beta release.  See below
for new more complete instructions for testing this with an open
source stack (now that I know there is a lazy way to stand up an LDAP
server using the TAP test stuff, I've adjusted the instructions to
work with that).

I'd like to commit this soon.

Some random things I noticed that I am not fixing in this patch but
wanted to mention:  I don't like the asymmetry initStringInfo(si),
pfree(si->data).  I don't like si->data as a way to get a C string
from a StringInfo.  There are a couple of references to StringBuffer
that surely mean StringInfo in comments.

=== How to test ===

1.  Start up an LDAP server that has a user test1/secret1 under
dc=example,dc=net (it runs in the background and you can stop it with
SIGINT):

$ make -C src/test/ldap check
$ /usr/local/libexec/slapd -f src/test/ldap/tmp_check/slapd.conf -h
ldap://127.0.0.1:

2.  Start up a BIND daemon that has multiple SRV records for LDAP at
example.com:

$ tail -4 /usr/local/etc/namedb/named.conf
zone "example.net" {
type master;
file "/usr/local/etc/namedb/master/example.net";
};
$ cat /usr/local/etc/namedb/master/example.net
$TTL10
@   IN  SOA ns.example.net. admin.example.net. (
  2 ; Serial
 604800 ; Refresh
  86400 ; Retry
2419200 ; Expire
 604800 )   ; Negative Cache TTL
IN  NS  ns.example.net.
ns.example.net. IN A 127.0.0.1
example.net. IN A 127.0.0.1
ldap1.example.net. IN A 127.0.0.1
ldap2.example.net. IN A 127.0.0.1
_ldap._tcp.example.net. IN SRV 0 0  ldap1
_ldap._tcp.example.net. IN SRV 1 0  ldap2

3.  Tell your OS to talk to that DNS server (and, erm, keep what you
had here so you can restore it later):

$ cat /etc/resolv.conf
nameserver 127.0.0.1

4.  Check that standard DNS and LDAP tools can find their way to your
LDAP servers via these breadcrumbs:

$ host -t srv _ldap._tcp.example.net
_ldap._tcp.example.net has SRV record 0 0  ldap1.example.net.
_ldap._tcp.example.net has SRV record 1 0  ldap2.example.net.

$ ldapsearch -H 'ldap:///dc%3Dexample%2Cdc%3Dnet' -b 'dc=example,dc=net'

5.  Tell PostgreSQL to use SRV records in pg_hba.conf using either of
these styles:

host all test1 127.0.0.1/32 ldap basedn="dc=example,dc=net"
host all test1 127.0.0.1/32 ldap ldapurl="ldap:///dc=example,dc=net?uid?sub";

6.  Check that you now log in as test1/secret1:

$ psql -h 127.0.0.1 postgres test1

-- 
Thomas Munro
https://enterprisedb.com

0001-Add-DNS-SRV-support-for-LDAP-server-discovery-v3.patch
Description: Binary data

Willing to fix a PQexec() in libpq module

2019-03-19 Thread Wu, Fei

Hi,all

On website: https://wiki.postgresql.org/wiki/Todo#libpq
I found that in libpq module,there is a TODO case:
---
Consider disallowing multiple queries in PQexec() as an additional barrier to 
SQL injection attacks
---
I am interested in this one. So ,Had it be fixed?
If not, I am willing to do so.
In manual, I found that:
-
Unlike PQexec, PQexecParams allows at most one SQL command in the given string. 
(There can be
semicolons in it, but not more than one nonempty command.) This is a limitation 
of the underlying
protocol, but has some usefulness as an extra defense against SQL-injection 
attacks.

---
Maybe we can fix PQexec() just likes PQexecParams()?

I will try to fix it~


--
Best Regards
-
Wu Fei
DX3
Software Division III
Nanjing Fujitsu Nanda Software Tech. Co., Ltd.(FNST)
ADDR.: No.6 Wenzhu Road, Software Avenue,
   Nanjing, 210012, China
TEL  : +86+25-86630566-9356
COINS: 7998-9356
FAX: +86+25-83317685
MAIL:wufei.f...@cn.fujitsu.com
http://www.fujitsu.com/cn/fnst/
---

Re: Proposal to suppress errors thrown by to_reg*()

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 16:35:32 +0900 (JST), Tatsuo Ishii  
wrote in <20190319.163532.529526338176696856.t-is...@sraoss.co.jp>
> >> According to the document, "to_reg* functions return null rather than
> >> throwing an error if the name is not found", but this is not the case
> >> if the arguments to those functions are schema qualified and the
> >> caller does not have access permission of the schema even if the table
> >> (or other object) does exist -- we get an error.
> > 
> > You explicitly specified the namespace and I think that it is not
> > the case of not-found. It is right that the error happens since
> > you explicitly tried to access a unprivileged schema.
> > 
> >> For example, to_regclass() throws an error if its argument is
> >> 'schema_name.table_name'' (i.e. contains schema name) and caller's
> >> role doesn't have access permission of the schema. Same thing can be
> >> said to Other functions,
> >> 
> >> I get complain from Pgpool-II users because it uses to_regclass()
> >> internally to confirm table's existence but in the case above it's
> >> not useful because the error aborts user's transaction.
> > 
> > I'm not sure how such unaccessible table names are given to the
> > function there, but it is also natural that any user trying to
> > access unprivileged objects gets an error.
> 
> You misunderstand the functionality of to_regclass(). Even if a user
> does not have an access privilege of certain table, to_regclass() does
> not raise an error.
>
> test=> select * from t1;
> ERROR:  permission denied for table t1
> 
> test=> select to_regclass('t1')::oid;
>  to_regclass 
> -
>  1647238
> (1 row)
> 
> So why can't we do the same thing for schema? For me, that way seems
> to be more consistent.

It seems to be a different thing. The oid 1647239 would be a
table in public schema or any schema that the user has access
to. If search_path contained only unprivileged schemas, the
function silently ignores such schemas.

=> set search_path to s1;   -- the user doesn't have access to this schema.
=> select to_regclass('t1')::oid; -- the table is really exists.
> to_regclass 
> -
>  
> (1 row)

Superuser gets the exepcted result.

=#  set search_path to s1;
=# select to_regclass('t1')::oid; -- superuser has access to s1.
>  to_regclass 
> -
>87612
> (1 row)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Proposal to suppress errors thrown by to_reg*()

2019-03-19 Thread Tatsuo Ishii

>> You misunderstand the functionality of to_regclass(). Even if a user
>> does not have an access privilege of certain table, to_regclass() does
>> not raise an error.
>>
>> test=> select * from t1;
>> ERROR:  permission denied for table t1
>> 
>> test=> select to_regclass('t1')::oid;
>>  to_regclass 
>> -
>>  1647238
>> (1 row)
>> 
>> So why can't we do the same thing for schema? For me, that way seems
>> to be more consistent.
> 
> It seems to be a different thing. The oid 1647239 would be a
> table in public schema or any schema that the user has access
> to. If search_path contained only unprivileged schemas, the
> function silently ignores such schemas.
> 
> => set search_path to s1;   -- the user doesn't have access to this 
> schema.
> => select to_regclass('t1')::oid; -- the table is really exists.
>> to_regclass 
>> -
>>  
>> (1 row)

I (and Hoshiai-san) concern about following case:

# revoke usage on schema s1 from foo;
REVOKE
:
[connect as foo]
test=> select to_regclass('s1.t1')::oid;
ERROR:  permission denied for schema s1

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Masahiko Sawada

On Tue, Mar 19, 2019 at 10:39 AM Haribabu Kommi
 wrote:
>
>
> On Mon, Mar 18, 2019 at 1:58 PM Masahiko Sawada  wrote:
>>
>> On Tue, Feb 26, 2019 at 7:20 PM Masahiko Sawada  
>> wrote:
>> >
>> > On Tue, Feb 26, 2019 at 1:35 PM Haribabu Kommi  
>> > wrote:
>> > >
>> > > On Thu, Feb 14, 2019 at 9:17 PM Masahiko Sawada  
>> > > wrote:
>> > >>
>> > >> Thank you. Attached the rebased patch.
>> > >
>> > >
>> > > I ran some performance tests to compare the parallelism benefits,
>> >
>> > Thank you for testing!
>> >
>> > > but I got some strange results of performance overhead, may be it is
>> > > because, I tested it on my laptop.
>> >
>> > Hmm, I think the parallel vacuum would help for heavy workloads like a
>> > big table with multiple indexes. In your test result, all executions
>> > are completed within 1 sec, which seems to be one use case that the
>> > parallel vacuum wouldn't help. I suspect that the table is small,
>> > right? Anyway I'll also do performance tests.
>> >
>>
>> Here is the performance test results. I've setup a 500MB table with
>> several indexes and made 10% of table dirty before each vacuum.
>> Compared execution time of the patched postgrse with the current HEAD
>> (at 'speed_up' column). In my environment,
>>
>>  indexes | parallel_degree |  patched   |head| speed_up
>> -+-+++--
>>   0 |   0 |   238.2085 |   244.7625 |   1.0275
>>   0 |   1 |   237.7050 |   244.7625 |   1.0297
>>   0 |   2 |   238.0390 |   244.7625 |   1.0282
>>   0 |   4 |   238.1045 |   244.7625 |   1.0280
>>   0 |   8 |   237.8995 |   244.7625 |   1.0288
>>   0 |  16 |   237.7775 |   244.7625 |   1.0294
>>   1 |   0 |  1328.8590 |  1334.9125 |   1.0046
>>   1 |   1 |  1325.9140 |  1334.9125 |   1.0068
>>   1 |   2 |  1333.3665 |  1334.9125 |   1.0012
>>   1 |   4 |  1329.5205 |  1334.9125 |   1.0041
>>   1 |   8 |  1334.2255 |  1334.9125 |   1.0005
>>   1 |  16 |  1335.1510 |  1334.9125 |   0.9998
>>   2 |   0 |  2426.2905 |  2427.5165 |   1.0005
>>   2 |   1 |  1416.0595 |  2427.5165 |   1.7143
>>   2 |   2 |  1411.6270 |  2427.5165 |   1.7197
>>   2 |   4 |  1411.6490 |  2427.5165 |   1.7196
>>   2 |   8 |  1410.1750 |  2427.5165 |   1.7214
>>   2 |  16 |  1413.4985 |  2427.5165 |   1.7174
>>   4 |   0 |  4622.5060 |  4619.0340 |   0.9992
>>   4 |   1 |  2536.8435 |  4619.0340 |   1.8208
>>   4 |   2 |  2548.3615 |  4619.0340 |   1.8126
>>   4 |   4 |  1467.9655 |  4619.0340 |   3.1466
>>   4 |   8 |  1486.3155 |  4619.0340 |   3.1077
>>   4 |  16 |  1481.7150 |  4619.0340 |   3.1174
>>   8 |   0 |  9039.3810 |  8990.4735 |   0.9946
>>   8 |   1 |  4807.5880 |  8990.4735 |   1.8701
>>   8 |   2 |  3786.7620 |  8990.4735 |   2.3742
>>   8 |   4 |  2924.2205 |  8990.4735 |   3.0745
>>   8 |   8 |  2684.2545 |  8990.4735 |   3.3493
>>   8 |  16 |  2672.9800 |  8990.4735 |   3.3635
>>  16 |   0 | 17821.4715 | 17740.1300 |   0.9954
>>  16 |   1 |  9318.3810 | 17740.1300 |   1.9038
>>  16 |   2 |  7260.6315 | 17740.1300 |   2.4433
>>  16 |   4 |  5538.5225 | 17740.1300 |   3.2030
>>  16 |   8 |  5368.5255 | 17740.1300 |   3.3045
>>  16 |  16 |  5291.8510 | 17740.1300 |   3.3523
>> (36 rows)
>
>
> The performance results are good. Do we want to add the recommended
> size in the document for the parallel option? the parallel option for smaller
> tables can lead to performance overhead.
>

Hmm, I don't think we can add the specific recommended size because
the performance gain by parallel lazy vacuum depends on various things
such as CPU cores, the number of indexes, shared buffer size, index
types, HDD or SSD. I suppose that users who want to use this option
have some sort of performance problem such as that vacuum takes a very
long time. They would use it for relatively larger tables.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Imai, Yoshikazu

Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > You posted a link to some performance numbers, but I didn't see the
> > test setup explained there.  I'd like to get some more information on
> > this impact of this.  Is there an effect with 100 tables, or do you
> need 10?
> 
> Imai-san, can you tell us the test setup?

Maybe I used this test setup[1].

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to 
current master so I fixed it) and Amit's v32 patch for speeding up planner[3]. 

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
  master  locallock  v32  v32+locallock
  --  -  ---  -
auto21.9   22.96,834  7,355
custom  19.7   20.07,415  7,252


[1] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05
[3] 
https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp

--
Yoshikazu Imai

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

Re: jsonpath

2019-03-19 Thread John Naylor

On Sun, Feb 24, 2019 at 5:03 PM Alexander Korotkov
 wrote:
> On Wed, Jan 30, 2019 at 5:28 AM Andres Freund  wrote:
> > Why -CF, and why is -p repeated?
>
> BTW, for our SQL grammar we have
>
> > scan.c: FLEXFLAGS = -CF -p -p
>
> Is it kind of default?

I just saw this in the committed patch. This is not default, it's
chosen for maximum performance at the expense of binary/memory size.
That's fine, but with a little effort you can also make the scanner
non-backtracking for additional performance, as in the attached.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

jsonpath-scan-nobackup.patch
Description: Binary data

Re: partitioned tables referenced by FKs

2019-03-19 Thread Amit Langote

Hi,

Thanks for updating the patch.  I'll reply to other parts separately.

On 2019/03/19 7:02, Alvaro Herrera wrote:
> A pretty silly bug remains here.  Watch:
> 
> create table pk (a int primary key) partition by list (a);
> create table pk1 partition of pk for values in (1);
> create table fk (a int references pk);
> insert into pk values (1);
> insert into fk values (1);
> drop table pk cascade;
> 
> Note that the DROP of the main PK table is pretty aggressive: since it
> cascades, you want it to drop the constraint on the FK side.  This would
> work without a problem if 'pk' was a non-partitioned table, but in this
> case it fails:
> 
> alvherre=# drop table pk cascade;
> NOTICE:  drop cascades to constraint fk_a_fkey on table fk
> ERROR:  removing partition "pk1" violates foreign key constraint "fk_a_fkey1"
> DETALLE:  Key (a)=(1) still referenced from table "fk".
> 
> The reason is that the "pre drop check" that's done before allow a drop
> of the partition doesn't realize that the constraint is also being
> dropped (and so the check ought to be skipped).  If you were to do just
> "DROP TABLE pk1" then this complaint would be correct, since it would
> leave the constraint in an invalid state.  But here, it's bogus and
> annoying.  You can DELETE the matching values from table FK and then the
> DROP works.  Here's another problem caused by the same misbehavior:
> 
> alvherre=# drop table pk, fk;
> ERROR:  removing partition "pk1" violates foreign key constraint "fk_a_fkey1"
> DETALLE:  Key (a)=(1) still referenced from table "fk".
> 
> Note here we want to get rid of table 'fk' completely.  If you split it
> up in a DROP of fk, followed by a DROP of pk, it works.
> 
> I'm not yet sure what's the best way to attack this.  Maybe the
> "pre-drop check" needs a completely different approach.  The simplest
> approach is to prohibit a table drop or detach for any partitioned table
> that's referenced by a foreign key, but that seems obnoxious and
> inconvenient.

Agreed.

Will it suffice or be OK if we skipped invoking the pre-drop callback for
objects that are being "indirectly" dropped?  I came up with the attached
patch to resolve this problem, if that idea has any merit.  We also get
slightly better error message as seen the expected/foreign_key.out changes.

Thanks,
Amit
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 11ec9d2f85..223ff9e4d8 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -237,17 +237,15 @@ deleteObjectsInList(ObjectAddresses *targetObjects, 
Relation *depRel,
{
const ObjectAddress *thisobj = &targetObjects->refs[i];
Oid objectClass = getObjectClass(thisobj);
+   const ObjectAddressExtra *extra = &targetObjects->extras[i];
+   booloriginal = extra->flags & DEPFLAG_ORIGINAL;
 
if (trackDroppedObjectsNeeded() && !(flags & 
PERFORM_DELETION_INTERNAL))
{
if (EventTriggerSupportsObjectClass(objectClass))
{
-   booloriginal = false;
boolnormal = false;
-   const ObjectAddressExtra *extra = 
&targetObjects->extras[i];
 
-   if (extra->flags & DEPFLAG_ORIGINAL)
-   original = true;
if (extra->flags & DEPFLAG_NORMAL ||
extra->flags & DEPFLAG_REVERSE)
normal = true;
@@ -256,6 +254,13 @@ deleteObjectsInList(ObjectAddresses *targetObjects, 
Relation *depRel,
}
}
 
+   /*
+* Do not invoke the callback for objects that are being dropped
+* indirectly.
+*/
+   if (!original)
+   continue;
+
/* Invoke class-specific pre-deletion checks */
switch (objectClass)
{
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 6680f755dd..9305ba8017 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1837,14 +1837,11 @@ pre_drop_class_check(Oid relationId, Oid objectSubId)
relation = relation_open(relationId, NoLock);
 
/*
-* For leaf partitions, this is our last chance to verify any foreign 
keys
+* For partitions, this is our last chance to verify any foreign keys
 * that may point to the partition as referenced table.
 */
-   if (relation->rd_rel->relkind == RELKIND_RELATION &&
-   relation->rd_rel->relispartition)
-   CheckNoForeignKeyRefs(relation,
- 
GetParentedForeignKeyRefs(relation),
-

Re: [HACKERS] Custom compression methods

2019-03-19 Thread Chris Travers

On Mon, Mar 18, 2019 at 11:09 PM Tomas Vondra 
wrote:

>
>
> On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> > On Fri, 15 Mar 2019 14:07:14 +0400
> > David Steele  wrote:
> >
> >> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> >>> On Thu, Mar 7, 2019 at 10:43 AM David Steele  >>> > wrote:
> >>>
> >>> On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >>>
> >>>  > there are another set of patches.
> >>>  > Only rebased to current master.
> >>>  >
> >>>  > Also I will change status on commitfest to 'Needs review'.
> >>>
> >>> This patch has seen periodic rebases but no code review that I
> >>> can see since last January 2018.
> >>>
> >>> As Andres noted in [1], I think that we need to decide if this
> >>> is a feature that we want rather than just continuing to push it
> >>> from CF to CF.
> >>>
> >>>
> >>> Yes.  I took a look at code of this patch.  I think it's in pretty
> >>> good shape.  But high level review/discussion is required.
> >>
> >> OK, but I think this patch can only be pushed one more time, maximum,
> >> before it should be rejected.
> >>
> >> Regards,
> >
> > Hi,
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
> >
>
> That may be one of the reasons, yes. But there are other reasons, which
> I think may be playing a bigger role.
>
> There's one practical issue with how the patch is structured - the docs
> and tests are in separate patches towards the end of the patch series,
> which makes it impossible to commit the preceding parts. This needs to
> change. Otherwise the patch size kills the patch as a whole.
>
> But there's a more important cost/benefit issue, I think. When I look at
> patches as a committer, I naturally have to weight how much time I spend
> on getting it in (and then dealing with fallout from bugs etc) vs. what
> I get in return (measured in benefits for community, users). This patch
> is pretty large and complex, so the "costs" are quite high, while the
> benefits from the patch itself is the ability to pick between pg_lz and
> zlib. Which is not great, and so people tend to pick other patches.
>
> Now, I understand there's a lot of potential benefits further down the
> line, like column-level compression (which I think is the main goal
> here). But that's not included in the patch, so the gains are somewhat
> far in the future.
>

Not discussing whether any particular committer should pick this up but I
want to discuss an important use case we have at Adjust for this sort of
patch.

The PostgreSQL compression strategy is something we find inadequate for at
least one of our large deployments (a large debug log spanning 10PB+).  Our
current solution is to set storage so that it does not compress and then
run on ZFS to get compression speedups on spinning disks.

But running PostgreSQL on ZFS has some annoying costs because we have
copy-on-write on copy-on-write, and when you add file fragmentation... I
would really like to be able to get away from having to do ZFS as an
underlying filesystem.  While we have good write throughput, read
throughput is not as good as I would like.

An approach that would give us better row-level compression  would allow us
to ditch the COW filesystem under PostgreSQL approach.

So I think the benefits are actually quite high particularly for those
dealing with volume/variety problems where things like JSONB might be a
go-to solution.  Similarly I could totally see having systems which handle
large amounts of specialized text having extensions for dealing with these.

>
> But hey, I think there are committers working for postgrespro, who might
> have the motivation to get this over the line. Of course, assuming that
> there are no serious objections to having this functionality or how it's
> implemented ... But I don't think that was the case.
>

While I am not currently able to speak for questions of how it is
implemented, I can say with very little doubt that we would almost
certainly use this functionality if it were there and I could see plenty of
other cases where this would be a very appropriate direction for some other
projects as well.

> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>

-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] CLUSTER command progress monitor

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 11:02:57 +0900, Tatsuro Yamada 
 wrote in 

> On 2019/03/19 10:43, Tatsuro Yamada wrote:
> > Hi Rafia!
> > On 2019/03/18 20:42, Rafia Sabih wrote:
> >> On Fri, 8 Mar 2019 at 09:14, Tatsuro Yamada
> >>  wrote:
> >>> Attached file is rebased patch on current HEAD.
> >>> I changed a status. :)
> >>>
> >>>
> >> Looks like the patch needs a rebase.
> >> I was on the commit fb5806533f9fe0433290d84c9b019399cd69e9c2
> >>
> >> PFA reject file in case you want to have a look.
> > Thanks for testing it. :)
> > I rebased the patch on the current head:
> > f2004f19ed9c9228d3ea2b12379ccb4b9212641f.
> > Please find attached file.
> > Also, I share my test case of progress monitor below.
> 
> 
> Attached patch is a rebased document patch. :)

The monitor view has four columns:

  heap_tuples_scanned
: used while the "seq scan" and "index scan" phases.

  heap_blks_total, heap_blks_scanned
: used only while the "seq scan" phase.

  index_rebuild_count:
: used only while the "rebuilding index" phase.

Couldn't we change the view like the following?

.. |   phase| heap tuples scanned | progress indicator  | value
..-++-+-+
.. | seq scan ..|  2134214| heap blocks pct | 23.5%
.. | index sc ..|  5453395| (null)  |  (null)
.. | sorting  ..| | (null)  |  (null)
.. | swapping ..| | (null)  |  (null)
.. | rebuildi ..| | index count |  2
.. | performi ..| | (null)  |  (null)

Only seq scan phase has two kind of progress indicator so if we
choose "heap blks pct" as the only indicator for the phase, it
could be simplified as:

.. |   phase | progress indicator  | value
..-+-+-+
.. | seq scan .. | heap blocks pct | 23.5%
.. | index sc .. | heap tuples scanned | 2398457
.. | sorting  .. | (null)  | (null)
.. | swapping .. | (null)  | (null)
.. | rebuildi .. | index count | 2
.. | performi .. | (null)  | (null)

A downside of the view is that it looks quite differently from
pg_stat_progress_vacuum. Since I'm not sure it's a good design,
feel free to oppose/reject this.

finish_heap_swap is also called in matview code path but the
function doesn't seem to detect that situation. Is it right
behavior? (I didn't confirm what happens when it is called from
matview refresh path, though.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Masahiko Sawada

On Tue, Mar 19, 2019 at 4:59 PM Kyotaro HORIGUCHI
 wrote:
>
> At Tue, 19 Mar 2019 13:31:04 +0900, Masahiko Sawada  
> wrote in 
> > > For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave
> > > almost the same. I suspect that the indexes are too-small and all
> > > the index pages were on memory and CPU is saturated. Maybe you
> > > had four cores and parallel workers more than the number had no
> > > effect.  Other normal backends should have been able do almost
> > > nothing meanwhile. Usually the number of parallel workers is
> > > determined so that IO capacity is filled up but this feature
> > > intermittently saturates CPU capacity very under such a
> > > situation.
> > >
> >
> > I'm sorry I didn't make it clear enough. If the parallel degree is
> > higher than 'the number of indexes - 1' redundant workers are not
> > launched. So for indexes=4, 8, 16 the number of actually launched
> > parallel workers is up to 3, 7, 15 respectively. That's why the result
> > shows almost the same execution time in the cases where nindexes <=
> > parallel_degree.
>
> In the 16 indexes case, the performance saturated at 4 workers
> which contradicts to your explanation.

Because the machine I used has 4 cores the performance doesn't get
improved even if more than 4 parallel workers are launched.

>
> > I'll share the performance test result of more larger tables and indexes.
> >
> > > I'm not sure, but what if we do index vacuum in one-tuple-by-one
> > > manner? That is, heap vacuum passes dead tuple one-by-one (or
> > > buffering few tuples) to workers and workers process it not by
> > > bulkdelete, but just tuple_delete (we don't have one). That could
> > > avoid the sleep time of heap-scan while index bulkdelete.
> > >
> >
> > Just to be clear, in parallel lazy vacuum all parallel vacuum
> > processes including the leader process do index vacuuming, no one
> > doesn't sleep during index vacuuming. The leader process does heap
> > scan and launches parallel workers before index vacuuming. Each
> > processes exclusively processes indexes one by one.
>
> The leader doesn't continue heap-scan while index vacuuming is
> running. And the index-page-scan seems eat up CPU easily. If
> index vacuum can run simultaneously with the next heap scan
> phase, we can make index scan finishes almost the same time with
> the next round of heap scan. It would reduce the (possible) CPU
> contention. But this requires as the twice size of shared
> memoryas the current implement.

Yeah, I've considered that something like pipe-lining approach that
one process continue to queue the dead tuples and other process
fetches and processes them during index vacuuming but the current
version patch employed the most simple approach as the first step.
Once we had the retail index deletion approach we might be able to use
it for parallel vacuum.

>
> > Such index deletion method could be an optimization but I'm not sure
> > that the calling tuple_delete many times would be faster than one
> > bulkdelete. If there are many dead tuples vacuum has to call
> > tuple_delete as much as dead tuples. In general one seqscan is faster
> > than tons of indexscan. There is the proposal for such one by one
> > index deletions[1] but it's not a replacement of bulkdelete.
>
> I'm not sure what you mean by 'replacement' but it depends on how
> large part of a table is removed at once. As mentioned in the
> thread. But unfortunately it doesn't seem easy to do..
>
> > > > Attached the updated version patches. The patches apply to the current
> > > > HEAD cleanly but the 0001 patch still changes the vacuum option to a
> > > > Node since it's under the discussion. After the direction has been
> > > > decided, I'll update the patches.
> > >
> > > As for the to-be-or-not-to-be a node problem, I don't think it is
> > > needed but from the point of consistency, it seems reasonable and
> > > it is seen in other nodes that *Stmt Node holds option Node. But
> > > makeVacOpt and it's usage, and subsequent operations on the node
> > > look somewhat strange.. Why don't you just do
> > > "makeNode(VacuumOptions)"?
> >
> > Thank you for the comment but this part has gone away as the recent
> > commit changed the grammar production of vacuum command.
>
> Oops!
>
>
> > > >+  /* Estimate size for dead tuples -- 
> > > >PARALLEL_VACUUM_KEY_DEAD_TUPLES */
> > > >+maxtuples = compute_max_dead_tuples(nblocks, nindexes > 0);
> > >
> > > If I understand this correctly, nindexes is always > 1 there. At
> > > lesat asserted that > 0 there.
> > >
> > > >+  estdt = MAXALIGN(add_size(sizeof(LVDeadTuples),
> > >
> > > I don't think the name is good. (dt menant detach by the first look for 
> > > me..)
> >
> > Fixed.
> >
> > >
> > > >+if (lps->nworkers_requested > 0)
> > > >+appendStringInfo(&buf,
> > > >+ ngettext("launched %d parallel vacuum 
> > > >worker for index cleanup (planned: %d, requested %d)",
> > >
> > > "planne

Re: Proposal to suppress errors thrown by to_reg*()

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 17:54:01 +0900 (JST), Tatsuo Ishii  
wrote in <20190319.175401.646838939186238443.t-is...@sraoss.co.jp>
> > It seems to be a different thing. The oid 1647239 would be a
> > table in public schema or any schema that the user has access
> > to. If search_path contained only unprivileged schemas, the
> > function silently ignores such schemas.
> > 
> > => set search_path to s1;   -- the user doesn't have access to this 
> > schema.
> > => select to_regclass('t1')::oid; -- the table is really exists.
> >> to_regclass 
> >> -
> >>  
> >> (1 row)
> 
> I (and Hoshiai-san) concern about following case:
> 
> # revoke usage on schema s1 from foo;
> REVOKE
> :
> [connect as foo]
> test=> select to_regclass('s1.t1')::oid;
> ERROR:  permission denied for schema s1

That works in a transaction. It looks right that the actually
revoked schema cannot be accessed.

S1:foo: begin;
S2:su : revoke usage on schema s1 from foo;
S1:foo: select to_regclass('s1.t1')::oid;
>  to_regclass 
> -
>16418
S2:foo: commit;
S2:foo: select to_regclass('s1.t1')::oid;
> ERROR:  permission denied for schema s1

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 19:01:06 +0900, Masahiko Sawada  
wrote in 
> On Tue, Mar 19, 2019 at 4:59 PM Kyotaro HORIGUCHI
>  wrote:
> >
> > At Tue, 19 Mar 2019 13:31:04 +0900, Masahiko Sawada  
> > wrote in 
> > 
> > > > For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave
> > > > almost the same. I suspect that the indexes are too-small and all
> > > > the index pages were on memory and CPU is saturated. Maybe you
> > > > had four cores and parallel workers more than the number had no
> > > > effect.  Other normal backends should have been able do almost
> > > > nothing meanwhile. Usually the number of parallel workers is
> > > > determined so that IO capacity is filled up but this feature
> > > > intermittently saturates CPU capacity very under such a
> > > > situation.
> > > >
> > >
> > > I'm sorry I didn't make it clear enough. If the parallel degree is
> > > higher than 'the number of indexes - 1' redundant workers are not
> > > launched. So for indexes=4, 8, 16 the number of actually launched
> > > parallel workers is up to 3, 7, 15 respectively. That's why the result
> > > shows almost the same execution time in the cases where nindexes <=
> > > parallel_degree.
> >
> > In the 16 indexes case, the performance saturated at 4 workers
> > which contradicts to your explanation.
> 
> Because the machine I used has 4 cores the performance doesn't get
> improved even if more than 4 parallel workers are launched.

That is what I mentioned in the cited phrases. Sorry for perhaps
hard-to-read phrases.. 

> >
> > > I'll share the performance test result of more larger tables and indexes.
> > >
> > > > I'm not sure, but what if we do index vacuum in one-tuple-by-one
> > > > manner? That is, heap vacuum passes dead tuple one-by-one (or
> > > > buffering few tuples) to workers and workers process it not by
> > > > bulkdelete, but just tuple_delete (we don't have one). That could
> > > > avoid the sleep time of heap-scan while index bulkdelete.
> > > >
> > >
> > > Just to be clear, in parallel lazy vacuum all parallel vacuum
> > > processes including the leader process do index vacuuming, no one
> > > doesn't sleep during index vacuuming. The leader process does heap
> > > scan and launches parallel workers before index vacuuming. Each
> > > processes exclusively processes indexes one by one.
> >
> > The leader doesn't continue heap-scan while index vacuuming is
> > running. And the index-page-scan seems eat up CPU easily. If
> > index vacuum can run simultaneously with the next heap scan
> > phase, we can make index scan finishes almost the same time with
> > the next round of heap scan. It would reduce the (possible) CPU
> > contention. But this requires as the twice size of shared
> > memoryas the current implement.
> 
> Yeah, I've considered that something like pipe-lining approach that
> one process continue to queue the dead tuples and other process
> fetches and processes them during index vacuuming but the current
> version patch employed the most simple approach as the first step.
> Once we had the retail index deletion approach we might be able to use
> it for parallel vacuum.

Ok, I understood the direction.

...
> > > Sorry I couldn't get your comment. You meant to move nprocessed to
> > > LVParallelState?
> >
> > Exactly. I meant letting lvshared points to private memory, but
> > it might introduce confusion.
> 
> Hmm, I'm not sure it would be a good idea. It would introduce
> confusion as you mentioned. And since 'nprocessed' have to be
> pg_atomic_uint32 in parallel mode we will end up with having an
> another branch.

Ok. Agreed. Thank you for the pacience.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Kyotaro HORIGUCHI

At Tue, 19 Mar 2019 17:51:32 +0900, Masahiko Sawada  
wrote in 
> On Tue, Mar 19, 2019 at 10:39 AM Haribabu Kommi
>  wrote:
> > The performance results are good. Do we want to add the recommended
> > size in the document for the parallel option? the parallel option for 
> > smaller
> > tables can lead to performance overhead.
> >
> 
> Hmm, I don't think we can add the specific recommended size because
> the performance gain by parallel lazy vacuum depends on various things
> such as CPU cores, the number of indexes, shared buffer size, index
> types, HDD or SSD. I suppose that users who want to use this option
> have some sort of performance problem such as that vacuum takes a very
> long time. They would use it for relatively larger tables.

Agree that we have no recommended setting, but I strongly think that 
documentation on the downside or possible side effect of this feature is 
required for those who are to use the feature.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: ALTER TABLE on system catalogs

2019-03-19 Thread Peter Eisentraut

On 2019-02-08 04:04, Kyotaro HORIGUCHI wrote:
> Hi, I got redirected here by a kind suggestion by Tom.

I have committed my patch, which also addresses the issue you had in
your other thread.

I rest of these discussions have merit but are not really dependent on
my patch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: jsonpath

2019-03-19 Thread Alexander Korotkov

On Mon, Mar 18, 2019 at 11:23 PM Tom Lane  wrote:
> Alexander Korotkov  writes:
> > On Mon, Mar 18, 2019 at 10:08 PM Tom Lane  wrote:
> >> Just another minor bitch about this patch: jsonpath_scan.l has introduced
> >> a typedef called "keyword".  This is causing pgindent to produce seriously
> >> ugly results in libpq, and probably in other places where that is used as
> >> a field or variable name.  Please rename that typedef to something less
> >> generic.
>
> > Ooops... I propose to rename it to KeyWord, which is already
> > typedef'ed in formatting.c.  See the attached patch.  Is it OK?
>
> I had in mind JsonPathKeyword or something like that.  If you re-use
> formatting.c's typedef name, pgindent won't care, but it's possible
> you'd be in for unhappiness when trying to look at these structs in
> gdb for instance.

Good point, thanks!  Pushed.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: jsonpath

2019-03-19 Thread Alexander Korotkov

On Tue, Mar 19, 2019 at 12:23 PM John Naylor
 wrote:
> On Sun, Feb 24, 2019 at 5:03 PM Alexander Korotkov
>  wrote:
> > On Wed, Jan 30, 2019 at 5:28 AM Andres Freund  wrote:
> > > Why -CF, and why is -p repeated?
> >
> > BTW, for our SQL grammar we have
> >
> > > scan.c: FLEXFLAGS = -CF -p -p
> >
> > Is it kind of default?
>
> I just saw this in the committed patch. This is not default, it's
> chosen for maximum performance at the expense of binary/memory size.
> That's fine, but with a little effort you can also make the scanner
> non-backtracking for additional performance, as in the attached.

We're working on patchset improving flex scanner.  Already have
similar change among the others.  But thanks a lot for your attention!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Kyotaro HORIGUCHI

Hello.

At Tue, 19 Mar 2019 08:18:23 +, "Wu, Fei"  wrote 
in <52E6E0843B9D774C8C73D6CF64402F05621F0FFC@G08CNEXMBPEKD02.g08.fujitsu.local>
> Hi,all
> 
> On website: https://wiki.postgresql.org/wiki/Todo#libpq
> I found that in libpq module,there is a TODO case:
> ---
> Consider disallowing multiple queries in PQexec() as an additional barrier to 
> SQL injection attacks
> ---
> I am interested in this one. So ,Had it be fixed?
> If not, I am willing to do so.
> In manual, I found that:
> -
> Unlike PQexec, PQexecParams allows at most one SQL command in the given 
> string. (There can be
> semicolons in it, but not more than one nonempty command.) This is a 
> limitation of the underlying
> protocol, but has some usefulness as an extra defense against SQL-injection 
> attacks.
> 
> ---
> Maybe we can fix PQexec() just likes PQexecParams()?
> 
> I will try to fix it~

I don't oppose that, but as the discussion linked from there [1],
psql already has a feature that sends multiple statements by one
PQexec() in two ways. Fixing it means making the features
obsolete.

psql db -c 'select 1; select 1;'

bash> psql db
db=> select 1\; select 1;


I couldn't find the documentation about the behavior..

[1] https://www.postgresql.org/message-id/9236.1167968...@sss.pgh.pa.us

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: extensions are hitting the ceiling

2019-03-19 Thread Jiří Fejfar

Hi all!

We are also facing some issues when using extensions. We are using
them quite intensively as a tool for maintaining our custom "DB
applications" with versioning, all tables, data, regression tests...
We find extensions great! We do not need other tool like flyway. My
colleague already posted some report to bug mailing list [1] but with
no response.

Our observations correspond well with your outline:

#1: Dependencies

* It is not possible to specify the version of extension we are
dependent on in .control file.

#2: Data in Extensions

I am not sure about the name "Configuration" Tables. From my point of
view extensions can hold two sorts of data:
1) "static" data: delivered with extension, inserted by update
scripts; the same "static" data are present across multiple
installation of extension in the same version. This data are not
supposed to be dumped.
2) "dynamic" data: inserted by users, have to be included in dumps,
are marked with pg_extension_config_dump and are called
"configuration" tables/data ... but why "configuration"?

#3 pg_dump and Extensions

* We have described some behavior of pg_dump, which we believe are in
fact bugs: see [1] "1) pg_dump with --schema parameter" and "2)
Hanging OID in extconfig".
* Maybe it would be good to introduce new switch pg_dump --extension
extA dumping all "dynamic" data from extension tables regardless on
schema

#4: Extension owned

* It is not possible to alter extension owner

Thanks, Jiří & Ivo.

[1] https://www.postgresql.org/message-id/15616-260dc9cb3bec7...@postgresql.org

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Fabien COELHO



Bonjour Michaël,


Please find attached an updated patch set, I have rebased that stuff
on top of my recent commits to refactor the control file updates.


Patch applies cleanly, compiles, make check-world seems ok, doc build ok.

It would help if the patch includes a version number. I assume that this 
is v7.


Doc looks ok.

Moving the controlfile looks like an effective way to prevent any 
concurrent start, as the fs operation is probably atomic and especially if 
external tools uses the same trick. However this is not the case yet, eg 
"pg_resetwal" uses a "postmaster.pid" hack instead. Probably the method 
could be unified, possibly with some functions in "controlfile_utils.c".


However, I think that there still is a race condition because of the order 
in which it is implemented:


 pg_checksums reads control file
 pg_checksums checks control file contents...
 ** cluster may be started and the control file updated
 pg_checksums moves the (updated) control file
 pg_checksums proceeds on a running cluster
 pg_checksums moves back the control file
 pg_checksums updates the control file contents, overriding updates

I think that the correct way to handle this for enable/disable is:

 pg_checksums moves the control file
 pg_checksums reads, checks, proceeds, updates
 pg_checksums moves back the control file

This probably means extending a little bit the update_controlfile function 
to allow a suffix. No big deal.


Ok, this might not work, because of the following, less likely, race 
condition:


 postmaster opens control file RW
 pg_checksums moves control file, posmater open file handle follows
 ...

So ISTM that we really need some locking to have something clean.

Why not always use "pgrename" instead of the strange pg_mv_file macro?

Help line about --check not simplified as suggested in a prior review, 
although you said you would take it into account.


Tests look ok.

--
Fabien.

Contribution to Perldoc for TestLib module in Postgres

2019-03-19 Thread Prajwal A V

Hi Hackers,

I was going through the regression testing framework used in postgres. I
was trying to understand the custom functions written in perl for postgres.

I could not find the perldoc for TestLib module in src/test/perl and found
some difficultly in understanding what each function does while other
modules in the src/test folder had perldoc and it was easy to understand
the functions.

I would like to contribute for the perldoc for TestLib. I am looking for
your suggestions if this contribution is worth doing.

Regards,
Prajwal

Re: unconstify equivalent for volatile

2019-03-19 Thread Peter Eisentraut

On 2019-02-18 16:42, Andres Freund wrote:
> On February 18, 2019 7:39:25 AM PST, Peter Eisentraut 
>  wrote:
>> I propose to add an equivalent to unconstify() for volatile.  When
>> working on this, I picked the name unvolatize() mostly as a joke, but
>> it
>> appears it's a real word.  Other ideas?
> 
> Shouldn't we just remove just about all those use of volatile? Basically the 
> only valid use these days is on sigsetjmp call sites.

So, after some recent cleanups and another one attached here, we're now
down to 1.5 uses of this.  (0.5 because the hunk in pmsignal.c doesn't
have a cast right now, but it would need one if s/MemSet/memset/.)
Those seem legitimate.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 89aada5bdc9ef5c0b0125a72eb9d5494bf360282 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 15 Mar 2019 08:46:06 +0100
Subject: [PATCH v2 01/10] Initialize structure at declaration

Avoids extra memset call and cast.
---
 contrib/dblink/dblink.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index d95e6bfa71..d35e5ba3d8 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -982,13 +982,11 @@ materializeQueryResult(FunctionCallInfo fcinfo,
 {
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
PGresult   *volatile res = NULL;
-   volatile storeInfo sinfo;
+   volatile storeInfo sinfo = {0};
 
/* prepTuplestoreResult must have been called previously */
Assert(rsinfo->returnMode == SFRM_Materialize);
 
-   /* initialize storeInfo to empty */
-   memset((void *) &sinfo, 0, sizeof(sinfo));
sinfo.fcinfo = fcinfo;
 
PG_TRY();

base-commit: 53680c116ce8c501e4081332d32ba0e93aa1aaa2
-- 
2.21.0

From 81a4e16fd06e8571d08b1d564e0c6ba1cb647476 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Mon, 18 Feb 2019 16:08:41 +0100
Subject: [PATCH v2 02/10] Add macro to cast away volatile without allowing
 changes to underlying type

This adds unvolatize(), which works just like unconstify() but for volatile.
---
 src/backend/postmaster/pgstat.c| 2 +-
 src/backend/storage/ipc/pmsignal.c | 2 +-
 src/include/c.h| 8 +++-
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2fbfadd9f0..2a8472b91a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3311,7 +3311,7 @@ pgstat_read_current_status(void)
localentry->backendStatus.st_procpid = 
beentry->st_procpid;
if (localentry->backendStatus.st_procpid > 0)
{
-   memcpy(&localentry->backendStatus, (char *) 
beentry, sizeof(PgBackendStatus));
+   memcpy(&localentry->backendStatus, 
unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
 
/*
 * strcpy is safe even if the string is 
modified concurrently,
diff --git a/src/backend/storage/ipc/pmsignal.c 
b/src/backend/storage/ipc/pmsignal.c
index d707993bf6..48f4311464 100644
--- a/src/backend/storage/ipc/pmsignal.c
+++ b/src/backend/storage/ipc/pmsignal.c
@@ -134,7 +134,7 @@ PMSignalShmemInit(void)
 
if (!found)
{
-   MemSet(PMSignalState, 0, PMSignalShmemSize());
+   MemSet(unvolatize(PMSignalData *, PMSignalState), 0, 
PMSignalShmemSize());
PMSignalState->num_child_flags = MaxLivePostmasterChildren();
}
 }
diff --git a/src/include/c.h b/src/include/c.h
index 658be50e0d..33c9518195 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -1122,7 +1122,7 @@ typedef union PGAlignedXLogBlock
 #endif
 
 /*
- * Macro that allows to cast constness away from an expression, but doesn't
+ * Macro that allows to cast constness and volatile away from an expression, 
but doesn't
  * allow changing the underlying type.  Enforcement of the latter
  * currently only works for gcc like compilers.
  *
@@ -1141,9 +1141,15 @@ typedef union PGAlignedXLogBlock
(StaticAssertExpr(__builtin_types_compatible_p(__typeof(expr), const 
underlying_type), \
  "wrong cast"), \
 (underlying_type) (expr))
+#define unvolatize(underlying_type, expr) \
+   (StaticAssertExpr(__builtin_types_compatible_p(__typeof(expr), volatile 
underlying_type), \
+ "wrong cast"), \
+(underlying_type) (expr))
 #else
 #define unconstify(underlying_type, expr) \
((underlying_type) (expr))
+#define unvolatize(underlying_type, expr) \
+   ((underlying_type) (expr))
 #endif
 
 /* 
-- 
2.21.0

Re: ALTER TABLE on system catalogs

2019-03-19 Thread Kyotaro HORIGUCHI

Sorry overlooked this.

At Thu, 14 Feb 2019 16:35:45 +0100, Peter Eisentraut 
 wrote in 
<84c6bcc4-026f-a44f-5726-e8035f4d1...@2ndquadrant.com>
> On 08/02/2019 04:04, Kyotaro HORIGUCHI wrote:
> > By the way, I'm confused to see that attributes that don't want
> > to go external are marked as 'x' in system catalogs. Currently
> > (putting aside its necessity) the following operation ends with
> > successful attaching a new TOAST relation, which we really don't
> > want.
> > 
> > ALTER TABLE pg_attribute ALTER COLUMN attrelid SET STORAGE plain;
> > 
> > Might be silly, but couldn't we have another storage class? Say,
> > Compression, which means try compress but don't go external.
> 
> That already exists: 'm': Value can be stored compressed inline

It works differently. 'm' doesn't prevent toast table from
creation, and 'c' does. But I agree that your patch fixes
that. My point was just seeming consistency in narrow area.

> I agree that it seems we should be using that for those tables that
> don't have a toast table.  Maybe the genbki stuff could do it
> automatically for the appropriate catalogs.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu

Amit-san,

On Tue, Mar 19, 2019 at 6:53 AM, Amit Langote wrote:
> On 2019/03/15 9:33, Imai, Yoshikazu wrote:
> > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote:
> >>> * In inheritance_planner(), we do ChangeVarNodes() only for
> >>> orig_rtable,
> >> so the codes concatenating lists of append_rel_list may be able to
> be
> >> moved before doing ChangeVarNodes() and then the codes concatenating
> >> lists of rowmarks, rtable and append_rel_list can be in one block (or
> >> one bunch).
> >>
> >> Yeah, perhaps.  I'll check.
> 
> I'm inclined to add source_appinfos to subroot->append_rel_list after
> finishing the ChangeVarNodes(subroot->append_rel_list) step, because if
> there are many entries in source_appinfos that would unnecessarily make
> ChangeVarNodes take longer.

OK, thanks.

> I've attached updated patches.  In the new version, I've moved some code
> from 0004 to 0005 patch, so as to avoid mixing unrelated modifications
> in one patch.  Especially, orig_rtable now only appears after applying
> 0005.
> 
> I appreciate your continued interest in these patches.

Thanks for new patches.
I looked over them and there are little comments.

[0002]
* s/regresion/regression/g
(in commit message.)

[0003]
* I thought "inh flag is it" is "inh flag is set" ...?

+* For RTE_RELATION rangetable entries whose inh flag is it, adjust the


* Below comments are correct when after applying 0004.

+* the query's target relation and no other.  Especially, children of 
any
+* source relations are added when the loop below calls grouping_planner
+* on the *1st* child target relation.

[0004]
* s/contain contain/contain/

+* will contain contain references to the subquery RTEs that 
we've


* s/find them children/find their children/

+* AppendRelInfos needed to find them children, so the 
next

[0006]
* s/recursivly/recursively/
(in commit message)


I have no more comments about codes other than above :)

--
Yoshikazu Imai

Re: [HACKERS] Custom compression methods

2019-03-19 Thread Tomas Vondra



On 3/19/19 10:59 AM, Chris Travers wrote:
> 
> 
> On Mon, Mar 18, 2019 at 11:09 PM Tomas Vondra
> mailto:tomas.von...@2ndquadrant.com>> wrote:
> 
> 
> 
> On 3/15/19 12:52 PM, Ildus Kurbangaliev wrote:
> > On Fri, 15 Mar 2019 14:07:14 +0400
> > David Steele mailto:da...@pgmasters.net>> wrote:
> >
> >> On 3/7/19 11:50 AM, Alexander Korotkov wrote:
> >>> On Thu, Mar 7, 2019 at 10:43 AM David Steele
> mailto:da...@pgmasters.net>
> >>> >> wrote:
> >>>
> >>>     On 2/28/19 5:44 PM, Ildus Kurbangaliev wrote:
> >>>
> >>>      > there are another set of patches.
> >>>      > Only rebased to current master.
> >>>      >
> >>>      > Also I will change status on commitfest to 'Needs review'.
> >>>
> >>>     This patch has seen periodic rebases but no code review that I
> >>> can see since last January 2018.
> >>>
> >>>     As Andres noted in [1], I think that we need to decide if this
> >>> is a feature that we want rather than just continuing to push it
> >>> from CF to CF.
> >>>
> >>>
> >>> Yes.  I took a look at code of this patch.  I think it's in pretty
> >>> good shape.  But high level review/discussion is required.
> >>
> >> OK, but I think this patch can only be pushed one more time,
> maximum,
> >> before it should be rejected.
> >>
> >> Regards,
> >
> > Hi,
> > in my opinion this patch is usually skipped not because it is not
> > needed, but because of its size. It is not hard to maintain it until
> > commiters will have time for it or I will get actual response that
> > nobody is going to commit it.
> >
> 
> That may be one of the reasons, yes. But there are other reasons, which
> I think may be playing a bigger role.
> 
> There's one practical issue with how the patch is structured - the docs
> and tests are in separate patches towards the end of the patch series,
> which makes it impossible to commit the preceding parts. This needs to
> change. Otherwise the patch size kills the patch as a whole.
> 
> But there's a more important cost/benefit issue, I think. When I look at
> patches as a committer, I naturally have to weight how much time I spend
> on getting it in (and then dealing with fallout from bugs etc) vs. what
> I get in return (measured in benefits for community, users). This patch
> is pretty large and complex, so the "costs" are quite high, while the
> benefits from the patch itself is the ability to pick between pg_lz and
> zlib. Which is not great, and so people tend to pick other patches.
> 
> Now, I understand there's a lot of potential benefits further down the
> line, like column-level compression (which I think is the main goal
> here). But that's not included in the patch, so the gains are somewhat
> far in the future.
> 
> 
> Not discussing whether any particular committer should pick this up but
> I want to discuss an important use case we have at Adjust for this sort
> of patch.
> 
> The PostgreSQL compression strategy is something we find inadequate for
> at least one of our large deployments (a large debug log spanning
> 10PB+).  Our current solution is to set storage so that it does not
> compress and then run on ZFS to get compression speedups on spinning disks.
> 
> But running PostgreSQL on ZFS has some annoying costs because we have
> copy-on-write on copy-on-write, and when you add file fragmentation... I
> would really like to be able to get away from having to do ZFS as an
> underlying filesystem.  While we have good write throughput, read
> throughput is not as good as I would like.
> 
> An approach that would give us better row-level compression  would allow
> us to ditch the COW filesystem under PostgreSQL approach.
> 
> So I think the benefits are actually quite high particularly for those
> dealing with volume/variety problems where things like JSONB might be a
> go-to solution.  Similarly I could totally see having systems which
> handle large amounts of specialized text having extensions for dealing
> with these.
> 

Sure, I don't disagree - the proposed compression approach may be a big
win for some deployments further down the road, no doubt about it. But
as I said, it's unclear when we get there (or if the interesting stuff
will be in some sort of extension, which I don't oppose in principle).

> 
> But hey, I think there are committers working for postgrespro, who might
> have the motivation to get this over the line. Of course, assuming that
> there are no serious objections to having this functionality or how it's
> implemented ... But I don't think that was the case.
> 
> 
> While I am not currently able to speak for questions of how it is
> implemented, I can say with very little doubt that we would almost
> certainly use this functionality if it were there and I c

Re: [HACKERS] Block level parallel vacuum

2019-03-19 Thread Masahiko Sawada

On Tue, Mar 19, 2019 at 7:15 PM Kyotaro HORIGUCHI
 wrote:
>
> At Tue, 19 Mar 2019 19:01:06 +0900, Masahiko Sawada  
> wrote in 
> > On Tue, Mar 19, 2019 at 4:59 PM Kyotaro HORIGUCHI
> >  wrote:
> > >
> > > At Tue, 19 Mar 2019 13:31:04 +0900, Masahiko Sawada 
> > >  wrote in 
> > > 
> > > > > For indexes=4,8,16, the cases with parallel_degree=4,8,16 behave
> > > > > almost the same. I suspect that the indexes are too-small and all
> > > > > the index pages were on memory and CPU is saturated. Maybe you
> > > > > had four cores and parallel workers more than the number had no
> > > > > effect.  Other normal backends should have been able do almost
> > > > > nothing meanwhile. Usually the number of parallel workers is
> > > > > determined so that IO capacity is filled up but this feature
> > > > > intermittently saturates CPU capacity very under such a
> > > > > situation.
> > > > >
> > > >
> > > > I'm sorry I didn't make it clear enough. If the parallel degree is
> > > > higher than 'the number of indexes - 1' redundant workers are not
> > > > launched. So for indexes=4, 8, 16 the number of actually launched
> > > > parallel workers is up to 3, 7, 15 respectively. That's why the result
> > > > shows almost the same execution time in the cases where nindexes <=
> > > > parallel_degree.
> > >
> > > In the 16 indexes case, the performance saturated at 4 workers
> > > which contradicts to your explanation.
> >
> > Because the machine I used has 4 cores the performance doesn't get
> > improved even if more than 4 parallel workers are launched.
>
> That is what I mentioned in the cited phrases. Sorry for perhaps
> hard-to-read phrases..

I understood now. Thank you!


Attached the updated version patches incorporated all review comments.

Commit 6776142 changed the grammar production of vacuum command. This
patch adds PARALLEL option on top of the commit.

I realized that the commit 6776142 breaks indents in ExecVacuum() and
the including nodes/parsenodes.h is no longer needed. Sorry that's my
wrong. Attached the patch (vacuum_fix.patch) fixes them, although the
indent issue will be resolved by pgindent before releasing.

In parsing vacuum command, since only PARALLEL option can have an
argument I've added the check in ExecVacuum to erroring out when other
options have an argument. But it might be good to make other vacuum
options (perhaps except for DISABLE_PAGE_SKIPPING option) accept an
argument just like EXPLAIN command.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


vacuum_fix.patch
Description: Binary data


v18-0001-Add-parallel-option-to-VACUUM-command.patch
Description: Binary data


v18-0002-Add-paralell-P-option-to-vacuumdb-command.patch
Description: Binary data

Re: Compressed TOAST Slicing

2019-03-19 Thread Stephen Frost

Greetings,

* Paul Ramsey (pram...@cleverelephant.ca) wrote:
> > On Mar 18, 2019, at 7:34 AM, Robert Haas  wrote:
> > +1.  I think Paul had it right originally.
> 
> In that spirit, here is a “one pglz_decompress function, new parameter” 
> version for commit.

Alright, I've been working through this and have made a few improvements
(the big comment block at the top of pg_lzcompress.c needed updating,
among a couple other minor things), but I was trying to wrap my head
around this:

+/* --
+ * toast_decompress_datum_slice -
+ *
+ * Decompress the front of a compressed version of a varlena datum.
+ * offset handling happens in heap_tuple_untoast_attr_slice.
+ * Here we just decompress a slice from the front.
+ */
+static struct varlena *
+toast_decompress_datum_slice(struct varlena *attr, int32 slicelength)
+{
+   struct varlena *result;
+   int32 rawsize;
+
+   Assert(VARATT_IS_COMPRESSED(attr));
+
+   result = (struct varlena *) palloc(slicelength + VARHDRSZ);
+   SET_VARSIZE(result, slicelength + VARHDRSZ);
+
+   rawsize = pglz_decompress(TOAST_COMPRESS_RAWDATA(attr),
+   VARSIZE(attr) - 
TOAST_COMPRESS_HDRSZ,
+   VARDATA(result),
+   slicelength, false);
+   if (rawsize < 0)
elog(ERROR, "compressed data is corrupted");
 
+   SET_VARSIZE(result, rawsize + VARHDRSZ);
return result;
 }

Specifically, the two SET_VARSIZE() calls, do we really need both..?
Are we sure that we're setting the length correctly there..?  Is there
any cross-check we can do?
 
I have to admit that I find the new argument to pglz_decompress() a bit
awkward to describe and document; if you have any thoughts as to how
that could be improved, that'd be great.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: partitioned tables referenced by FKs

2019-03-19 Thread Alvaro Herrera

On 2019-Mar-19, Amit Langote wrote:

> Will it suffice or be OK if we skipped invoking the pre-drop callback for
> objects that are being "indirectly" dropped?  I came up with the attached
> patch to resolve this problem, if that idea has any merit.  We also get
> slightly better error message as seen the expected/foreign_key.out changes.

I don't think this works ... consider putting some partition in a
different schema and then doing DROP SCHEMA CASCADE.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-19 Thread Alvaro Herrera

On 2019-Mar-19, Prajwal A V wrote:

> I could not find the perldoc for TestLib module in src/test/perl and found
> some difficultly in understanding what each function does while other
> modules in the src/test folder had perldoc and it was easy to understand
> the functions.
> 
> I would like to contribute for the perldoc for TestLib. I am looking for
> your suggestions if this contribution is worth doing.

Yes, it is, please do.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: partitioned tables referenced by FKs

2019-03-19 Thread Amit Langote

On Tue, Mar 19, 2019 at 8:49 PM Alvaro Herrera  wrote:
> On 2019-Mar-19, Amit Langote wrote:
>
> > Will it suffice or be OK if we skipped invoking the pre-drop callback for
> > objects that are being "indirectly" dropped?  I came up with the attached
> > patch to resolve this problem, if that idea has any merit.  We also get
> > slightly better error message as seen the expected/foreign_key.out changes.
>
> I don't think this works ... consider putting some partition in a
> different schema and then doing DROP SCHEMA CASCADE.

Ah, I did test DROP SCHEMA CASCADE, but only tested putting the top
table into the schema, not a partition.

Thanks,
Amit

Re: partitioned tables referenced by FKs

2019-03-19 Thread Alvaro Herrera

On 2019-Mar-19, Amit Langote wrote:

> On Tue, Mar 19, 2019 at 8:49 PM Alvaro Herrera  
> wrote:
> > On 2019-Mar-19, Amit Langote wrote:
> >
> > > Will it suffice or be OK if we skipped invoking the pre-drop callback for
> > > objects that are being "indirectly" dropped?  I came up with the attached
> > > patch to resolve this problem, if that idea has any merit.  We also get
> > > slightly better error message as seen the expected/foreign_key.out 
> > > changes.
> >
> > I don't think this works ... consider putting some partition in a
> > different schema and then doing DROP SCHEMA CASCADE.
> 
> Ah, I did test DROP SCHEMA CASCADE, but only tested putting the top
> table into the schema, not a partition.

:-(

I think this is fixable by using a two-step strategy where we first
compile a list of constraints being dropped, and in the second step we
know to ignore those when checking partitions that are being dropped.
Now, maybe the order of objects visited guarantees that the constraint
is seen (by drop) before each corresponding partition, and in that case
we don't need two steps, just one.  I'll have a look at that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Michael Paquier

On Tue, Mar 19, 2019 at 11:48:25AM +0100, Fabien COELHO wrote:
> Moving the controlfile looks like an effective way to prevent any concurrent
> start, as the fs operation is probably atomic and especially if external
> tools uses the same trick. However this is not the case yet, eg
> "pg_resetwal" uses a "postmaster.pid" hack instead.

pg_upgrade does so.  Note that pg_resetwal does not check either that
the PID in the file is actually running.

> Probably the method could be unified, possibly with some functions
> in "controlfile_utils.c".

Hm.  postmaster.pid is just here to make sure that the instance is not
started at all, while we require the instance to be stopped cleanly
with other tools, so that's not really consistent in my opinion to
combine both.

> Ok, this might not work, because of the following, less likely, race
> condition:
> 
>  postmaster opens control file RW
>  pg_checksums moves control file, postmater open file handle follows
>  ...
> 
> So ISTM that we really need some locking to have something clean.

We are talking about complicating a method which is already fine for a
a window where the whole operation works, as it could take hours to
enable checksums, versus a couple of instructions.  I am not sure that
it is worth complicating the code more.

> Help line about --check not simplified as suggested in a prior review,
> although you said you would take it into account.

Oops, it looks like this got lost because of the successive rebases.
I am sure to have updated it at some point..  Anyway, thanks for
pointing it out, I got that fixed on my local branch.
--
Michael

signature.asc
Description: PGP signature

Re: Contribution to Perldoc for TestLib module in Postgres

2019-03-19 Thread Michael Paquier

On Tue, Mar 19, 2019 at 09:05:29AM -0300, Alvaro Herrera wrote:
> Yes, it is, please do.

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Fwd: Add tablespace tap test to pg_rewind

2019-03-19 Thread Shaoqi Bai

On Tue, Mar 12, 2019 at 10:27 AM Michael Paquier 
wrote:

> On Mon, Mar 11, 2019 at 07:49:11PM +0800, Shaoqi Bai wrote:
> > Thanks, will work on it as you suggested
> > Add pg_basebackup --T olddir=newdir to support check the consistency of a
> > tablespace created before promotion
> > Add run_test('remote');
>
> Thanks for considering my input.  Why don't you register your patch to
> the next commit fest then so as it goes through a formal review once
> you are able to provide a new version?  The commit fest is here:
> https://commitfest.postgresql.org/23/
>
> We are currently in the process of wrapping up the last commit fest
> for v12, so this stuff will have to wait a bit :(
>
> It could be an idea to split the patch in two pieces:
> - One patch which refactors the code for the new option in
> PostgresNode.pm
> - Second patch for the new test with integration in RewindTest.pm.
> This should touch different parts of the code, so combining both would
> be fine as well for me :)
>

Thanks for your advice, sorry for taking so long to give update in the
thread, because I am stuck in modifing Perl script, knowing little about
Perl language.

I tried to do the following two things in this patch.
1. add pg_basebackup --T olddir=newdir to support check the consistency of
a tablespace created before promotion
2. run both run_test('local') and run_test('remote');

The code can pass make installcheck under src/bin/pg_rewind, but can not
pass make installcheck under src/bin/pg_basebackup.
Because the patch refactor is not done well.

The patch still need refactor, to make other tests pass, like tests under
src/bin/pg_basebackup.
Sending the letter is just to let you know the little progress on the
thread.

0001-Add-tablespace-tap-test-for-pg_rewind.patch
Description: Binary data

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 11:48 +0100 schrieb Fabien COELHO:
> Moving the controlfile looks like an effective way to prevent any 
> concurrent start, as the fs operation is probably atomic and especially if 
> external tools uses the same trick. However this is not the case yet, eg 
> "pg_resetwal" uses a "postmaster.pid" hack instead. Probably the method 
> could be unified, possibly with some functions in "controlfile_utils.c".
> 
> However, I think that there still is a race condition because of the order 
> in which it is implemented:
> 
>   pg_checksums reads control file
>   pg_checksums checks control file contents...
>   ** cluster may be started and the control file updated
>   pg_checksums moves the (updated) control file
>   pg_checksums proceeds on a running cluster
>   pg_checksums moves back the control file
>   pg_checksums updates the control file contents, overriding updates
> 
> I think that the correct way to handle this for enable/disable is:
> 
>   pg_checksums moves the control file
>   pg_checksums reads, checks, proceeds, updates
>   pg_checksums moves back the control file
> 
> This probably means extending a little bit the update_controlfile function 
> to allow a suffix. No big deal.
> 
> Ok, this might not work, because of the following, less likely, race 
> condition:
> 
>   postmaster opens control file RW
>   pg_checksums moves control file, posmater open file handle follows
>   ...
> 
> So ISTM that we really need some locking to have something clean.

I think starting the postmaster during offline maintenance is already
quite some pilot error. As pg_checksums can potentially run for hours
though, I agree it is important to disable the cluster in the meantime.

There's really not a lot going on between pg_checksums reading the
control file and moving it away - what you propose above sounds a bit
like overengineering to me.

If anything, we could include the postmaster.pid check from pg_resetwal
after we have renamed the control file to make absolutely sure that the
cluster is offline. Once the control file is gone and there is no
postmaster.pid, it surely cannot be pg_checksums' problem anymore if a
postmaster is started regardless of maintenance.

I leave that to Michael to decide whether he thinks the above is
warranted.

I think the more important open issue is what to do about PITR and
streaming replication, see my replies to Magnus elsewhere in the thread.

> Why not always use "pgrename" instead of the strange pg_mv_file macro?

pg_ugprade does it the same way, possibly both could be converted to
pgrename, dunno.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Fabien COELHO





Ok, this might not work, because of the following, less likely, race
condition:

 postmaster opens control file RW
 pg_checksums moves control file, postmater open file handle follows
 ...

So ISTM that we really need some locking to have something clean.


We are talking about complicating a method which is already fine for a
a window where the whole operation works, as it could take hours to
enable checksums, versus a couple of instructions.  I am not sure that
it is worth complicating the code more.


Hmmm. Possibly. The point is that anything only needs to be implemented 
once. The whole point of pg is to have ACID transactional properties, but 
it does not achieve that on the controlfile, which I find paradoxical:-)


Now if there is a race condition opportunity, ISTM that it should be as 
short as possible. Renaming before manipulating seems safer if other 
commands proceeds like that as well. Probably if pg always rename *THEN* 
open before doing anything in all commands it could be safe, provided that 
the renaming is atomic, which I think is the case.


That would avoid locking, at the price of a small probability of having a 
controlfile in a controlfile.command-that-failed-at-the-wrong-time state. 
Maybe it is okay. Maybe there is a need to be able to force the 
state back to something to recover from such unlikely event, but probably 
it does already exists (eg postmaster could be dead without releasing the 
controlfile state).


--
Fabien.

Re: pg_upgrade version checking questions

2019-03-19 Thread Daniel Gustafsson

On Tuesday, March 19, 2019 7:55 AM, Bruce Momjian  wrote:

> On Tue, Mar 19, 2019 at 02:43:49AM -0400, Bruce Momjian wrote:
>
> > > 3.  Actually, I'm kind of wondering why pg_upgrade has a --new-bindir
> > > option at all, rather than just insisting on finding the new-version
> > > executables in the same directory it is in. This seems like, at best,
> > > a hangover from before it got into core. Even if you don't want to
> > > remove the option, we could surely provide a useful default setting
> > > based on find_my_exec. (I'm amused to notice that pg_upgrade
> > > currently takes the trouble to find out its own path, and then does
> > > precisely nothing with the information.)
> > >
> >
> > Good point. You are right that when it was outside of the source tree,
> > and even in /contrib, that would not have worked easily. Makes sense to
> > at least default to the same directory as pg_upgrade.
>
> I guess an open question is whether we should remove the --new-bindir
> option completely.

If the default is made to find the new-version binaries in the same directory,
keeping --new-bindir could still be useful for easier testing of pg_upgrade.

cheers ./daniel

Re: Index Skip Scan

2019-03-19 Thread Dmitry Dolgov

> On Sat, Mar 16, 2019 at 5:14 PM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Fri, Mar 15, 2019 at 4:55 AM Kyotaro HORIGUCHI 
> >  wrote:
> > I have some comments on the latest v11 patch.
>
> Thank you!

In the meantime here is a new version, rebased after tableam changes.


v13-0001-Index-skip-scan.patch
Description: Binary data

Re: partitioned tables referenced by FKs

2019-03-19 Thread Alvaro Herrera

On 2019-Mar-14, Robert Haas wrote:

> On Thu, Mar 14, 2019 at 3:36 PM Alvaro Herrera  
> wrote:

> > In any case, since the RI
> > queries are run via SPI, any unnecessary partitions should get pruned by
> > partition pruning based on each partition's constraint.  So I'm not so
> > certain that this is a problem worth spending much time/effort/risk of
> > bugs on.
> 
> I doubt that partition pruning would just automatically DTRT in a case
> like this, but if I'm wrong, great!

Uh, it was you that came up with the partition pruning system, so of
course it DsTRT.

I did test (some of?) the queries issued by ri_triggers with partitioned
tables on the PK side, and indeed it seemed to me that partitions were
being pruned at the right times.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-19 Thread Maksim Milyutin


On 3/16/19 5:32 PM, Robert Haas wrote:


There's only one query ID field available, and
you can't use two extensions that care about query ID unless they
compute it the same way, and replicating all the code that computes
the query ID into each new extension that wants one sucks.  I think we
should actually bite the bullet and move all of that code into core,
and then just let extensions say whether they care about it getting
set.



+1.

But I think that enough to integrate into core the query normalization 
routine and store generalized query strings (from which the queryId is 
produced) in shared memory (for example, hashtable that maps queryId to 
the text representation of generalized query). And activate 
normalization routine and filling the table of generalized queries by 
specified GUC.


This allows to unbind extensions that require queryId from using 
pg_stat_statements and consider such computing of queryId as canonical.



--
Regards,
Maksim Milyutin

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> At Tue, 19 Mar 2019 08:18:23 +, "Wu, Fei"  
> wrote in 
> <52E6E0843B9D774C8C73D6CF64402F05621F0FFC@G08CNEXMBPEKD02.g08.fujitsu.local>
>> I will try to fix it~

> I don't oppose that, but as the discussion linked from there [1],
> psql already has a feature that sends multiple statements by one
> PQexec() in two ways. Fixing it means making the features
> obsolete.

Yeah, the problem here is that a lot of people think that that's
a feature not a bug.  You certainly can't get away with just summarily
changing the behavior of PQexec without any recourse.  Maybe there
would be acceptance for either of

(1) a different function that is like PQexec but restricts the
query string

(2) a connection option or state variable that affects PQexec's
behavior --- but it probably still has to default to permissive.

Unfortunately, if the default behavior doesn't change, then there's little
argument for doing this at all.  The security reasoning behind doing
anything in this area would be to provide an extra measure of protection
against SQL-injection attacks on carelessly-written clients, and of course
it's unlikely that a carelessly-written client would get changed to make
use of a non-default feature.

So that's why nothing has been done about this for umpteen years.
If somebody can think of a way to resolve this tension, maybe the
item will get finished; but it's not just a matter of writing some
code.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-19 Thread Julien Rouhaud

On Tue, Mar 19, 2019 at 2:45 PM Maksim Milyutin  wrote:
>
> But I think that enough to integrate into core the query normalization
> routine and store generalized query strings (from which the queryId is
> produced) in shared memory (for example, hashtable that maps queryId to
> the text representation of generalized query).

That's more or less how pg_stat_statements was previously behaving,
and it had too many problems.  Current implementation, with an
external file, is a better alternative.

> And activate
> normalization routine and filling the table of generalized queries by
> specified GUC.
>
> This allows to unbind extensions that require queryId from using
> pg_stat_statements and consider such computing of queryId as canonical.

The problem I see with this approach is that if you want a different
implementation, you'll have to reimplement the in-core normalised
queries saving and retrieval, but with a different set of SQL-visible
functions.  I don't think that's it's acceptable, unless we add a
specific hook for query normalisation and queryid computing.  But it
isn't ideal either, as it would be a total mess if someone changes the
implementation without resetting the previously saved normalised
queries.

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-19 Thread Julien Rouhaud

On Mon, Mar 18, 2019 at 7:33 PM Julien Rouhaud  wrote:
>
> On Mon, Mar 18, 2019 at 6:23 PM Yun Li  wrote:
> >
> > Let's take one step back. Since queryId is stored in core as Julien pointed 
> > out, can we just add that global to the pg_stat_get_activity and ultimately 
> > exposed in pg_stat_activity view? Then no matter whether PGSS  is on or 
> > off, or however the customer extensions are updating that filed, we expose 
> > that field in that view then enable user to leverage that id to join with 
> > pgss or their extension. Will this sounds a good idea?
>
> I'd greatly welcome expose queryid exposure in pg_stat_activity, and
> also in log_line_prefix.  I'm afraid that it's too late for pg12
> inclusion, but I'll be happy to provide a patch for that for pg13.

Here's a prototype patch for queryid exposure in pg_stat_activity and
log_line prefix.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d383de2512..37570825be 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6260,6 +6260,11 @@ local0.*/var/log/postgresql
  session processes
  no
 
+
+ %Q
+ queryid: identifier of session's current query, if any
+ yes
+
 
  %%
  Literal %
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index ac2721c8ad..726c9430d5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -800,6 +800,19 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  xid
  The current backend's xmin horizon.
 
+
+ queryid
+ bigint
+ Identifier this backend's most recent query. If
+  state is active this field
+  shows the identifier of the currently executing query. In all other
+  states, it shows the identifier of last query that was executed, unless
+  an error occured which will reset this field to 0.  By default, query
+  identifiers are not computed, so this field will always display 0, unless
+  an additional module that compute query identifiers, such as , is configured.
+ 
+
 
  query
  text
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d962648bc5..6b62c7db1c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -708,6 +708,7 @@ CREATE VIEW pg_stat_activity AS
 S.state,
 S.backend_xid,
 s.backend_xmin,
+S.queryid,
 S.query,
 S.backend_type
 FROM pg_stat_get_activity(NULL) AS S
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 63a34760ee..955722d3a4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -140,6 +140,8 @@ static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 void
 ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
+	pg_atomic_write_u64(&MyProc->queryId, queryDesc->plannedstmt->queryId);
+
 	if (ExecutorStart_hook)
 		(*ExecutorStart_hook) (queryDesc, eflags);
 	else
@@ -300,6 +302,8 @@ ExecutorRun(QueryDesc *queryDesc,
 			ScanDirection direction, uint64 count,
 			bool execute_once)
 {
+	pg_atomic_write_u64(&MyProc->queryId, queryDesc->plannedstmt->queryId);
+
 	if (ExecutorRun_hook)
 		(*ExecutorRun_hook) (queryDesc, direction, count, execute_once);
 	else
@@ -399,6 +403,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 void
 ExecutorFinish(QueryDesc *queryDesc)
 {
+	pg_atomic_write_u64(&MyProc->queryId, queryDesc->plannedstmt->queryId);
+
 	if (ExecutorFinish_hook)
 		(*ExecutorFinish_hook) (queryDesc);
 	else
@@ -459,6 +465,8 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 void
 ExecutorEnd(QueryDesc *queryDesc)
 {
+	pg_atomic_write_u64(&MyProc->queryId, queryDesc->plannedstmt->queryId);
+
 	if (ExecutorEnd_hook)
 		(*ExecutorEnd_hook) (queryDesc);
 	else
@@ -538,6 +546,8 @@ ExecutorRewind(QueryDesc *queryDesc)
 	/* It's probably not sensible to rescan updating queries */
 	Assert(queryDesc->operation == CMD_SELECT);
 
+	pg_atomic_write_u64(&MyProc->queryId, queryDesc->plannedstmt->queryId);
+
 	/*
 	 * Switch into per-query memory context
 	 */
diff --git a/src/backend/executor/spi.c b/src/backend/executor/spi.c
index d898f4ca78..0729c2f1a3 100644
--- a/src/backend/executor/spi.c
+++ b/src/backend/executor/spi.c
@@ -24,6 +24,7 @@
 #include "executor/executor.h"
 #include "executor/spi_priv.h"
 #include "miscadmin.h"
+#include "storage/proc.h"
 #include "tcop/pquery.h"
 #include "tcop/utility.h"
 #include "utils/builtins.h"
@@ -1879,6 +1880,7 @@ _SPI_prepare_plan(const char *src, SPIPlanPtr plan)
 	List	   *plancache_list;
 	ListCell   *list_item;
 	ErrorContextCallback spierrcontext;
+	uint64		old_queryId = pg_atomic_read_u64(&MyProc->queryId);
 
 	/*
 	 * Setup error traceback support for ereport()
@@ -1935,6 +1937,8 @@ _SPI_prepare_plan(

Re: [HACKERS] Cached plans and statement generalization

2019-03-19 Thread Konstantin Knizhnik


Thank you very much for the review!

On 19.03.2019 5:56, Yamaji, Ryo wrote:

On Tue, Jan 29, 2019 at 10:46 AM, Konstantin Knizhnik wrote:

Rebased version of the patch is attached.

I'm sorry for the late review.
I confirmed behavior of autoprepare-12.patch. It is summarized below.

・parameter
Expected behavior was shown according to the set value.
However, I think that it is be more kind to describe that autoprepare
hold infinite statements when the setting value of
autoprepare_ (memory_) limit is 0 in the manual.
There is no problem as operation.


Sorry, I do not completely understand your concern.
Description of autoprepare_ (memory_) limit includes explanation of zero 
value:


autoprepare_limit:
0 means unlimited number of autoprepared queries. Too large number of 
prepared queries can cause backend memory overflow and slowdown 
execution speed (because of increased lookup time.


autoprepare_memory_limit:
0 means that there is no memory limit. Calculating memory used by 
prepared queries adds some extra overhead,
so non-zero value of this parameter may cause some slowdown. 
autoprepare_limit is much faster way to limit number of autoprepared 
statements



Do you think that this descriptions are unclear and should be rewritten?


・pg_autoprepared_statements
I confirmed that I could refer properly.

・autoprepare cache retention period
I confirmed that autoprepared statements were deleted when the set
statement or the DDL statement was executed. Although it differs from
the explicit prepare statements, it does not matter as a autoprepare.

・performance
This patch does not confirm the basic performance of autoprepare because
it confirmed that there was no performance problem with the previous
patch (autoprepare-11.patch). However, because it was argued that
performance degradation would occur when prepared statements execute to
a partition table, I expected that autoprepare might exhibit similar
behavior, and measured the performance. I also predicted that the
plan_cache_mode setting does not apply to autoprepare, and we also
measured the plan_cache_mode by conditions.
Below results (this result is TPS)

 plan_cache_mode simple  simple(autoprepare) prepare
 auto130 121 121.5
 force_custom_plan   132.5   90.7122.7
 force_generic_plan  126.7   14.724.7

Performance degradation was observed when plan_cache_mode was specified
for autoprepare. Is this behavior correct? I do not know why this is the
results.

Below performance test procedure

drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
  (x)::text || ') to (' || (x+1)::text || ');' from generate_series(0, 1024) x;
\gexec
\o

pgbench -p port -T 60 -c 1 -n -f test.sql (-M prepared) postgres

test.sql
\set a random (0, 1023)
select * from rt where a = :a;
Autoprepare is using the same functions from plancache.c so 
plan_cache_mode settings affect

autoprepare as well as explicitly preprepared statements.

Below are my results of select-only pgbench:

plan_cache_mode simple  simple(autoprepare) prepare
auto23k 42k 50k
force_custom_plan   23k 24k 26k
force_generic_plan  23k 44k 50k


As you can see force_custom_plan slowdowns both explicitly and 
autoprepared statements.
Unfortunately generic plans are not working well with partitioned table 
because disabling partition pruning.

At my system result of your query execution is the following:

plan_cache_mode simple  simple(autoprepare) prepare
auto232 220 219
force_custom_plan   234 175 211
    force_generic_plan  230 48  48


The conclusion is that forcing generic plan can cause slowdown of 
queries on partitioned tables.
If plan cache mode is not enforced, then standard Postgres strategy of 
comparing efficiency of generic and custom plans

works well.

Attached please find rebased version of the patch.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/autoprepare.sgml b/doc/src/sgml/autoprepare.sgml
new file mode 100644
index 000..b3309bd
--- /dev/null
+++ b/doc/src/sgml/autoprepare.sgml
@@ -0,0 +1,62 @@
+
+
+ 
+  Autoprepared statements
+
+  
+   autoprepared statements
+  
+
+  
+PostgreSQL makes it possible prepare
+frequently used statements to eliminate cost of their compilation
+and optimization on each execution of the query. On simple queries
+(like ones in pgbench -S) using prepared statements
+increase performance more than two times.
+  
+
+  
+Unfortunately not all database applications are using prepared statements
+and, moreover, it is not always possible. For example, in case of us

Re: Problem with default partition pruning

2019-03-19 Thread Thibaut Madelaine



Le 19/03/2019 à 08:01, Yuzuko Hosoya a écrit :
> Hi Amit-san,
>
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> Sent: Monday, March 18, 2019 6:44 PM
>  
>> Hosoya-san,
>>
>> On 2019/03/15 15:05, Yuzuko Hosoya wrote:
>>> Indeed, it's problematic.  I also did test and I found that this
>>> problem was occurred when any partition didn't match WHERE clauses.
>>> So following query didn't work correctly.
>>>
>>> # explain select * from test1_3 where (id > 0 and id < 30);
>>>QUERY PLAN
>>> -
>>>  Append  (cost=0.00..58.16 rows=12 width=36)
>>>->  Seq Scan on test1_3_1  (cost=0.00..29.05 rows=6 width=36)
>>>  Filter: ((id > 0) AND (id < 30))
>>>->  Seq Scan on test1_3_2  (cost=0.00..29.05 rows=6 width=36)
>>>  Filter: ((id > 0) AND (id < 30))
>>> (5 rows)
>>>
>>> I created a new patch to handle this problem, and confirmed the query
>>> you mentioned works as expected
>>>
>>> # explain select * from test1 where (id > 0 and id < 30) or (id > 220 and 
>>> id < 230);
>>> QUERY PLAN
>>> --
>>> -  Append  (cost=0.00..70.93 rows=26 width=36)
>>>->  Seq Scan on test1_1_1  (cost=0.00..35.40 rows=13 width=36)
>>>  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id < 230)))
>>>->  Seq Scan on test1_3_1  (cost=0.00..35.40 rows=13 width=36)
>>>  Filter: (((id > 0) AND (id < 30)) OR ((id > 220) AND (id <
>>> 230)))
>>> (5 rows)
>>>
>>> v2 patch attached.
>>> Could you please check it again?
>> I think the updated patch breaks the promise that get_matching_range_bounds 
>> won't set scan_default
>> based on individual pruning value comparisons.  How about the attached delta 
>> patch that applies on
>> top of your earlier v1 patch, which fixes the issue reported by Thibaut?
>>
> Indeed.  I agreed with your proposal.
> Also, I confirmed your patch works correctly.
>
> Best regards,
> Yuzuko Hosoya

I kept on testing with sub-partitioning.
I found a case, using 2 default partitions, where a default partition is
not pruned:

--

create table test2(id int, val text) partition by range (id);
create table test2_20_plus_def partition of test2 default;
create table test2_0_20 partition of test2 for values from (0) to (20)
  partition by range (id);
create table test2_0_10 partition of test2_0_20 for values from (0) to (10);
create table test2_10_20_def partition of test2_0_20 default;

# explain (costs off) select * from test2 where id=5 or id=25;
   QUERY PLAN   
-
 Append
   ->  Seq Scan on test2_0_10
 Filter: ((id = 5) OR (id = 25))
   ->  Seq Scan on test2_10_20_def
 Filter: ((id = 5) OR (id = 25))
   ->  Seq Scan on test2_20_plus_def
 Filter: ((id = 5) OR (id = 25))
(7 rows)

--

I have the same output using Amit's v1-delta.patch or Hosoya's
v2_default_partition_pruning.patch.

Re: pg_upgrade version checking questions

2019-03-19 Thread Peter Eisentraut

On 2019-03-19 00:35, Tom Lane wrote:
> 2. check_cluster_versions() insists that the target version be the
> same major version as pg_upgrade itself, but is that really good enough?
> As things stand, it looks like pg_upgrade 11.3 would happily use pg_dump
> 11.1, or vice versa.

I'd hesitate to tie this down too much.  It's possible that either the
client or the server package cannot currently be upgraded because of
some other dependencies.  In fact, a careful packager might as a result
of a change like this tie the client and server packages together with
an exact version match.  This has the potential to make the global
dependency hell worse.

> 3. Actually, I'm kind of wondering why pg_upgrade has a --new-bindir
> option at all, rather than just insisting on finding the new-version
> executables in the same directory it is in.  This seems like, at best,
> a hangover from before it got into core.  Even if you don't want to
> remove the option, we could surely provide a useful default setting
> based on find_my_exec.

Previously discussed here:
https://www.postgresql.org/message-id/flat/1304710184.28821.9.camel%40vanquo.pezone.net
 (Summary: right)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online verification of checksums

2019-03-19 Thread Robert Haas

On Mon, Mar 18, 2019 at 2:38 AM Stephen Frost  wrote:
> Sure the backend has those facilities since it needs to, but these
> frontend tools *don't* need that to *never* have any false positives, so
> why are we complicating things by saying that this frontend tool and the
> backend have to coordinate?
>
> If there's an explanation of why we can't avoid having false positives
> in the frontend tool, I've yet to see it.  I definitely understand that
> we can get partial reads, but a partial read isn't a failure, and
> shouldn't be reported as such.

I think there's some confusion between 'partial read' and 'torn page',
as Michael also said.

It's torn pages that I am concerned about - the server is writing and
we are reading, and we get a mix of old and new content.  We have been
quite diligent about protecting ourselves from such risks elsewhere,
and checksum verification should not be held to any lesser standard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #15572: Misleading message reported by "Drop function operation" on DB with functions having same name

2019-03-19 Thread Pavel Stehule

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

I read a discussion and I think so currently implemented behave (by last patch) 
is correct in all details.

I propose maybe more strongly comment fact so noError is applied only on "not 
found" event. In other cases, this flag is ignored and error is raised 
immediately there. I think so it is not good enough commented why.
This is significant change - in previous releases, noError was used like really 
noError, so should be commented more.

Regress tests are enough.
The patch is possible to apply without problems and compile without warnings

The new status of this patch is: Ready for Committer

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> +/*
> + * Locations of persistent and temporary control files.  The control
> + * file gets renamed into a temporary location when enabling checksums
> + * to prevent a parallel startup of Postgres.
> + */
> +#define CONTROL_FILE_PATH"global/pg_control"
> +#define CONTROL_FILE_PATH_TEMP   CONTROL_FILE_PATH 
> ".pg_checksums_in_progress"

I think this should be outright rejected. Again, you're making the
control file into something it isn't. And there's no buyin for this as
far as I can tell outside of Fabien and you. For crying out loud, if the
server crashes during this YOU'VE CORRUPTED THE CLUSTER.

- Andres

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Peter Eisentraut

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

I have committed the first patch that reorganizes the struct.  I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle.  Performance tests from others welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Custom compression methods

2019-03-19 Thread Chris Travers

On Tue, Mar 19, 2019 at 12:19 PM Tomas Vondra 
wrote:

>
> On 3/19/19 10:59 AM, Chris Travers wrote:
> >
> >
> > Not discussing whether any particular committer should pick this up but
> > I want to discuss an important use case we have at Adjust for this sort
> > of patch.
> >
> > The PostgreSQL compression strategy is something we find inadequate for
> > at least one of our large deployments (a large debug log spanning
> > 10PB+).  Our current solution is to set storage so that it does not
> > compress and then run on ZFS to get compression speedups on spinning
> disks.
> >
> > But running PostgreSQL on ZFS has some annoying costs because we have
> > copy-on-write on copy-on-write, and when you add file fragmentation... I
> > would really like to be able to get away from having to do ZFS as an
> > underlying filesystem.  While we have good write throughput, read
> > throughput is not as good as I would like.
> >
> > An approach that would give us better row-level compression  would allow
> > us to ditch the COW filesystem under PostgreSQL approach.
> >
> > So I think the benefits are actually quite high particularly for those
> > dealing with volume/variety problems where things like JSONB might be a
> > go-to solution.  Similarly I could totally see having systems which
> > handle large amounts of specialized text having extensions for dealing
> > with these.
> >
>
> Sure, I don't disagree - the proposed compression approach may be a big
> win for some deployments further down the road, no doubt about it. But
> as I said, it's unclear when we get there (or if the interesting stuff
> will be in some sort of extension, which I don't oppose in principle).
>

I would assume that if extensions are particularly stable and useful they
could be moved into core.

But I would also assume that at first, this area would be sufficiently
experimental that folks (like us) would write our own extensions for it.


>
> >
> > But hey, I think there are committers working for postgrespro, who
> might
> > have the motivation to get this over the line. Of course, assuming
> that
> > there are no serious objections to having this functionality or how
> it's
> > implemented ... But I don't think that was the case.
> >
> >
> > While I am not currently able to speak for questions of how it is
> > implemented, I can say with very little doubt that we would almost
> > certainly use this functionality if it were there and I could see plenty
> > of other cases where this would be a very appropriate direction for some
> > other projects as well.
> >
> Well, I guess the best thing you can do to move this patch forward is to
> actually try that on your real-world use case, and report your results
> and possibly do a review of the patch.
>

Yeah, I expect to do this within the next month or two.


>
> IIRC there was an extension [1] leveraging this custom compression
> interface for better jsonb compression, so perhaps that would work for
> you (not sure if it's up to date with the current patch, though).
>
> [1]
>
> https://www.postgresql.org/message-id/20171130182009.1b492eb2%40wp.localdomain
>
> Yeah I will be looking at a couple different approaches here and reporting
back. I don't expect it will be a full production workload but I do expect
to be able to report on benchmarks in both storage and performance.


>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


-- 
Best Regards,
Chris Travers
Head of Database

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin

Re: Concurrency bug with vacuum full (cluster) and toast

2019-03-19 Thread Robert Haas

On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov
 wrote:
> I've discovered bug, when vacuum full fails with error, because it
> couldn't find toast chunks deleted by itself.  That happens because
> cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
> and independently.  That causes heap_page_prune_opt() to clean chunks,
> which rebuild_relation() expects to exist.  This bug very rarely
> happens on busy systems which actively update toast values.  But I
> found way to reliably reproduce it using debugger.

Boy, I really feel like we've talked about this before.  These are
somewhat-related discussions, but none of them are exactly the same
thing:

http://postgr.es/m/1335.1304187...@sss.pgh.pa.us
http://postgr.es/m/20362.1359747...@sss.pgh.pa.us
http://postgr.es/m/87in8nec96@news-spur.riddles.org.uk

I don't know whether we've actually talked about this precise problem
before and I just can't find the thread, or whether I'm confusing what
you've found here with some closely-related issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_upgrade version checking questions

2019-03-19 Thread Tom Lane

Peter Eisentraut  writes:
> On 2019-03-19 00:35, Tom Lane wrote:
>> 2. check_cluster_versions() insists that the target version be the
>> same major version as pg_upgrade itself, but is that really good enough?
>> As things stand, it looks like pg_upgrade 11.3 would happily use pg_dump
>> 11.1, or vice versa.

> I'd hesitate to tie this down too much.  It's possible that either the
> client or the server package cannot currently be upgraded because of
> some other dependencies.  In fact, a careful packager might as a result
> of a change like this tie the client and server packages together with
> an exact version match.  This has the potential to make the global
> dependency hell worse.

I'm not really getting your point here.  Packagers ordinarily tie
those versions together anyway, I'd expect --- there's no upside
to not doing so, and plenty of risk if one doesn't, because of
exactly the sort of coordinated-changes hazard I'm talking about here.

>> 3. Actually, I'm kind of wondering why pg_upgrade has a --new-bindir
>> option at all, rather than just insisting on finding the new-version
>> executables in the same directory it is in.  This seems like, at best,
>> a hangover from before it got into core.  Even if you don't want to
>> remove the option, we could surely provide a useful default setting
>> based on find_my_exec.

> Previously discussed here:
> https://www.postgresql.org/message-id/flat/1304710184.28821.9.camel%40vanquo.pezone.net
>  (Summary: right)

Mmm.  The point that a default is of no particular use to scripts is
still valid.  Shall we then remove the useless call to find_my_exec?

regards, tom lane

Re: Online verification of checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> It's torn pages that I am concerned about - the server is writing and
> we are reading, and we get a mix of old and new content.  We have been
> quite diligent about protecting ourselves from such risks elsewhere,
> and checksum verification should not be held to any lesser standard.

If we see a checksum failure on an otherwise correctly read block in
online mode, we retry the block on the theory that we might have read a
torn page.  If the checksum verification still fails, we compare its LSN
to the LSN of the current checkpoint and don't mind if its newer.  This
way, a torn page should not cause a false positive either way I think?. 
 If it is a genuine storage failure we will see it in the next
pg_checksums run as its LSN will be older than the checkpoint.  The
basebackup checksum verification works in the same way.

I am happy to look into further option about how to make things better,
but I am not sure what the actual problem might be that you mention
above. I will see whether I can stress-test the patch a bit more but
I've already taxed the SSD on my company notebook quite a bit during the
development of this so will see whether I can get some real server
hardware somewhere.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > +/*
> > + * Locations of persistent and temporary control files.  The control
> > + * file gets renamed into a temporary location when enabling checksums
> > + * to prevent a parallel startup of Postgres.
> > + */
> > +#define CONTROL_FILE_PATH  "global/pg_control"
> > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH 
> > ".pg_checksums_in_progress"
> 
> I think this should be outright rejected. Again, you're making the
> control file into something it isn't. And there's no buyin for this as
> far as I can tell outside of Fabien and you. For crying out loud, if the
> server crashes during this YOU'VE CORRUPTED THE CLUSTER.

The cluster is supposed to be offline during this.  This is just an
additional precaution so that nobody starts it during the operation -
similar to how pg_upgrade disables the old data directory.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > It's torn pages that I am concerned about - the server is writing and
> > we are reading, and we get a mix of old and new content.  We have been
> > quite diligent about protecting ourselves from such risks elsewhere,
> > and checksum verification should not be held to any lesser standard.
> 
> If we see a checksum failure on an otherwise correctly read block in
> online mode, we retry the block on the theory that we might have read a
> torn page.  If the checksum verification still fails, we compare its LSN
> to the LSN of the current checkpoint and don't mind if its newer.  This
> way, a torn page should not cause a false positive either way I
> think?.

False positives, no. But there's plenty potential for false
negatives. In plenty clusters a large fraction of the pages is going to
be touched in most checkpoints.


>  If it is a genuine storage failure we will see it in the next
> pg_checksums run as its LSN will be older than the checkpoint.

Well, but also, by that time it might be too late to recover things. Or
it might be a backup that you just made, that you later want to recover
from, ...


> The basebackup checksum verification works in the same way.

Shouldn't have been merged that way.

Greetings,

Andres Freund

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Andres Freund

On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> Hi,
> 
> Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > +/*
> > > + * Locations of persistent and temporary control files.  The control
> > > + * file gets renamed into a temporary location when enabling checksums
> > > + * to prevent a parallel startup of Postgres.
> > > + */
> > > +#define CONTROL_FILE_PATH"global/pg_control"
> > > +#define CONTROL_FILE_PATH_TEMP   CONTROL_FILE_PATH 
> > > ".pg_checksums_in_progress"
> > 
> > I think this should be outright rejected. Again, you're making the
> > control file into something it isn't. And there's no buyin for this as
> > far as I can tell outside of Fabien and you. For crying out loud, if the
> > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> 
> The cluster is supposed to be offline during this.  This is just an
> additional precaution so that nobody starts it during the operation -
> similar to how pg_upgrade disables the old data directory.

I don't see how that matters. Afterwards the cluster needs low level
surgery to be recovered. That's a) undocumented b) likely to be done
wrongly.  This is completely unacceptable *AND UNNECESSARY*.

Greetings,

Andres Freund

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > +/*
> > > > + * Locations of persistent and temporary control files.  The control
> > > > + * file gets renamed into a temporary location when enabling checksums
> > > > + * to prevent a parallel startup of Postgres.
> > > > + */
> > > > +#define CONTROL_FILE_PATH  "global/pg_control"
> > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH 
> > > > ".pg_checksums_in_progress"
> > > 
> > > I think this should be outright rejected. Again, you're making the
> > > control file into something it isn't. And there's no buyin for this as
> > > far as I can tell outside of Fabien and you. For crying out loud, if the
> > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > 
> > The cluster is supposed to be offline during this.  This is just an
> > additional precaution so that nobody starts it during the operation -
> > similar to how pg_upgrade disables the old data directory.
> 
> I don't see how that matters. Afterwards the cluster needs low level
> surgery to be recovered. That's a) undocumented b) likely to be done
> wrongly.  This is completely unacceptable *AND UNNECESSARY*.

Can you explain why low level surgery is needed and how that would look
like?

If pg_checksums successfully enables checksums, it will move back the
control file and update the checksum version - the cluster is ready to
be started again unless I am missing something?

If pg_checksums is interrupted by the admin, it will move back the
control file and the cluster is ready to be started again as well.

If pg_checksums aborts with a failure, the admin will have to move back
the control file before starting up the instance again, but I don't
think that counts?

If pg_checksums crashes due to I/O failures or other causes I can see
how possibly the block it was currently writing might need low level
surgery, but in that case we are in the domain of forensics already I
guess and that still does not pertain to the control file?

Sorry for being obtuse, I don't get it.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 17:08:17 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> > On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > > +/*
> > > > > + * Locations of persistent and temporary control files.  The control
> > > > > + * file gets renamed into a temporary location when enabling 
> > > > > checksums
> > > > > + * to prevent a parallel startup of Postgres.
> > > > > + */
> > > > > +#define CONTROL_FILE_PATH"global/pg_control"
> > > > > +#define CONTROL_FILE_PATH_TEMP   CONTROL_FILE_PATH 
> > > > > ".pg_checksums_in_progress"
> > > > 
> > > > I think this should be outright rejected. Again, you're making the
> > > > control file into something it isn't. And there's no buyin for this as
> > > > far as I can tell outside of Fabien and you. For crying out loud, if the
> > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > > 
> > > The cluster is supposed to be offline during this.  This is just an
> > > additional precaution so that nobody starts it during the operation -
> > > similar to how pg_upgrade disables the old data directory.
> > 
> > I don't see how that matters. Afterwards the cluster needs low level
> > surgery to be recovered. That's a) undocumented b) likely to be done
> > wrongly.  This is completely unacceptable *AND UNNECESSARY*.
> 
> Can you explain why low level surgery is needed and how that would look
> like?
> 
> If pg_checksums successfully enables checksums, it will move back the
> control file and update the checksum version - the cluster is ready to
> be started again unless I am missing something?
> 
> If pg_checksums is interrupted by the admin, it will move back the
> control file and the cluster is ready to be started again as well.
> 
> If pg_checksums aborts with a failure, the admin will have to move back
> the control file before starting up the instance again, but I don't
> think that counts?

That absolutely counts. Even a short period would imo be unacceptable,
but this will take *hours* in many clusters. It's completely possible
that the machine crashes while the enabling is in progress.

And after restarting postgres or even pg_checksums, you'll just get a
message that there's no control file. How on earth is a normal user
supposed to recover from that?  Now, you could have a check for the
control file under the temporary name, and emit a hint about renaming,
but that has its own angers (like people renaming it just to start
postgres).

And you're basically adding it because Fabien doesn't like
postmaster.pid and wants to invent another lockout mechanism in this
thread.

Greetings,

Andres Freund

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 09:13 -0700 schrieb Andres Freund:
> On 2019-03-19 17:08:17 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > > > +/*
> > > > > > + * Locations of persistent and temporary control files.  The 
> > > > > > control
> > > > > > + * file gets renamed into a temporary location when enabling 
> > > > > > checksums
> > > > > > + * to prevent a parallel startup of Postgres.
> > > > > > + */
> > > > > > +#define CONTROL_FILE_PATH  "global/pg_control"
> > > > > > +#define CONTROL_FILE_PATH_TEMP CONTROL_FILE_PATH 
> > > > > > ".pg_checksums_in_progress"
> > > > > 
> > > > > I think this should be outright rejected. Again, you're making the
> > > > > control file into something it isn't. And there's no buyin for this as
> > > > > far as I can tell outside of Fabien and you. For crying out loud, if 
> > > > > the
> > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > > > 
> > > > The cluster is supposed to be offline during this.  This is just an
> > > > additional precaution so that nobody starts it during the operation -
> > > > similar to how pg_upgrade disables the old data directory.
> > > 
> > > I don't see how that matters. Afterwards the cluster needs low level
> > > surgery to be recovered. That's a) undocumented b) likely to be done
> > > wrongly.  This is completely unacceptable *AND UNNECESSARY*.
> > 
> > Can you explain why low level surgery is needed and how that would look
> > like?
> > 
> > If pg_checksums successfully enables checksums, it will move back the
> > control file and update the checksum version - the cluster is ready to
> > be started again unless I am missing something?
> > 
> > If pg_checksums is interrupted by the admin, it will move back the
> > control file and the cluster is ready to be started again as well.
> > 
> > If pg_checksums aborts with a failure, the admin will have to move back
> > the control file before starting up the instance again, but I don't
> > think that counts?
> 
> That absolutely counts. Even a short period would imo be unacceptable,
> but this will take *hours* in many clusters. It's completely possible
> that the machine crashes while the enabling is in progress.
> 
> And after restarting postgres or even pg_checksums, you'll just get a
> message that there's no control file. How on earth is a normal user
> supposed to recover from that?  Now, you could have a check for the
> control file under the temporary name, and emit a hint about renaming,
> but that has its own angers (like people renaming it just to start
> postgres).

Ok, thanks for explaining. 

I guess if we check for the temporary name in postmaster during startup
if pg_control isn't there then a more generally useful name like
"pg_control.maintenance" should be picked. We could then spit out a nice
error message or hint like "the cluster has been disabled for
maintenance. In order to start it up anyway, rename
pg_control.maintenance to pg_control" or so.

In any case, this would be more of a operational or availability issue
and not a data-loss issue, as I feared from your previous mails.

> And you're basically adding it because Fabien doesn't like
> postmaster.pid and wants to invent another lockout mechanism in this
> thread.

I think the hazard of another DBA (or some automated configuration
management or HA tool for that matter) looking at postmaster.pid,
deciding that it is not a legit file from a running instance, deleting
it and then starting up Postgres while pg_checksums is still at work is
worse than the above scenario, but maybe if we make the content of
postmaster.pid clear enough (like "maintenance in progress"?) it would
be enough of a hint? Or do you have concrete suggestions on how this
should work?

I had the feeling (ab)using postmaster.pid for this would fly less than
using the same scheme as pg_upgrade does, but I'm fine doing it either
way.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread David Fetter

On Tue, Mar 19, 2019 at 10:30:45AM -0400, Tom Lane wrote:
> Kyotaro HORIGUCHI  writes:
> > At Tue, 19 Mar 2019 08:18:23 +, "Wu, Fei"  
> > wrote in 
> > <52E6E0843B9D774C8C73D6CF64402F05621F0FFC@G08CNEXMBPEKD02.g08.fujitsu.local>
> >> I will try to fix it~
> 
> > I don't oppose that, but as the discussion linked from there [1],
> > psql already has a feature that sends multiple statements by one
> > PQexec() in two ways. Fixing it means making the features
> > obsolete.
> 
> Yeah, the problem here is that a lot of people think that that's
> a feature not a bug.  You certainly can't get away with just summarily
> changing the behavior of PQexec without any recourse.  Maybe there
> would be acceptance for either of
> 
> (1) a different function that is like PQexec but restricts the
> query string
> 
> (2) a connection option or state variable that affects PQexec's
> behavior --- but it probably still has to default to permissive.
> 
> Unfortunately, if the default behavior doesn't change, then there's little
> argument for doing this at all.  The security reasoning behind doing
> anything in this area would be to provide an extra measure of protection
> against SQL-injection attacks on carelessly-written clients, and of course
> it's unlikely that a carelessly-written client would get changed to make
> use of a non-default feature.

It's also unlikely that writers and maintainers of carelessly-written
clients are our main user base. Quite the opposite, in fact. Do we
really need to set their failure to make an effort as a higher
priority than getting this fixed?

I think the answer is "no," and we should deprecate this misfeature.
It's bad enough that we'll be supporting it for five years after
deprecating it, but it's worse to leave it hanging around our necks
forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Libpq support to connect to standby server as priority

2019-03-19 Thread Robert Haas

On Mon, Mar 18, 2019 at 9:33 PM Haribabu Kommi  wrote:
> While working on implementation of target_server_type new connection option 
> for the libpq
> to specify master, slave and etc, there is no problem when the newly added 
> target_server_type
> option is used separate, but when it is combined with the existing 
> target_session_attrs, there
> may be some combinations that are not valid or such servers doesn't exist.
>
> Target_session_attrs  Target_server_type
>
> read-write   prefer-slave, slave
> prefer-read master, slave
> read-onlymaster, prefer-slave
>
> I know that some of the cases above is possible, like master server with by 
> default accepts
> read-only sessions. Instead of we put a check to validate what is right 
> combination, how
> about allowing the combinations and in case if such combination is not 
> possible, means
> there shouldn't be any server the supports the requirement, and connection 
> fails.
>
> comments?

I really dislike having both target_sesion_attrs and
target_server_type.  It doesn't solve any actual problem.  master,
slave, prefer-save, or whatever you like could be put in
target_session_attrs just as easily, and then we wouldn't end up with
two keywords doing closely related things.  'master' is no more or
less a server attribute than 'read-write'.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Offline enabling/disabling of data checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 17:30:16 +0100, Michael Banck wrote:
> Am Dienstag, den 19.03.2019, 09:13 -0700 schrieb Andres Freund:
> > On 2019-03-19 17:08:17 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 09:00 -0700 schrieb Andres Freund:
> > > > On 2019-03-19 16:55:12 +0100, Michael Banck wrote:
> > > > > Am Dienstag, den 19.03.2019, 08:36 -0700 schrieb Andres Freund:
> > > > > > On 2019-03-18 17:13:01 +0900, Michael Paquier wrote:
> > > > > > > +/*
> > > > > > > + * Locations of persistent and temporary control files.  The 
> > > > > > > control
> > > > > > > + * file gets renamed into a temporary location when enabling 
> > > > > > > checksums
> > > > > > > + * to prevent a parallel startup of Postgres.
> > > > > > > + */
> > > > > > > +#define CONTROL_FILE_PATH"global/pg_control"
> > > > > > > +#define CONTROL_FILE_PATH_TEMP   CONTROL_FILE_PATH 
> > > > > > > ".pg_checksums_in_progress"
> > > > > > 
> > > > > > I think this should be outright rejected. Again, you're making the
> > > > > > control file into something it isn't. And there's no buyin for this 
> > > > > > as
> > > > > > far as I can tell outside of Fabien and you. For crying out loud, 
> > > > > > if the
> > > > > > server crashes during this YOU'VE CORRUPTED THE CLUSTER.
> > > > > 
> > > > > The cluster is supposed to be offline during this.  This is just an
> > > > > additional precaution so that nobody starts it during the operation -
> > > > > similar to how pg_upgrade disables the old data directory.
> > > > 
> > > > I don't see how that matters. Afterwards the cluster needs low level
> > > > surgery to be recovered. That's a) undocumented b) likely to be done
> > > > wrongly.  This is completely unacceptable *AND UNNECESSARY*.
> > > 
> > > Can you explain why low level surgery is needed and how that would look
> > > like?
> > > 
> > > If pg_checksums successfully enables checksums, it will move back the
> > > control file and update the checksum version - the cluster is ready to
> > > be started again unless I am missing something?
> > > 
> > > If pg_checksums is interrupted by the admin, it will move back the
> > > control file and the cluster is ready to be started again as well.
> > > 
> > > If pg_checksums aborts with a failure, the admin will have to move back
> > > the control file before starting up the instance again, but I don't
> > > think that counts?
> > 
> > That absolutely counts. Even a short period would imo be unacceptable,
> > but this will take *hours* in many clusters. It's completely possible
> > that the machine crashes while the enabling is in progress.
> > 
> > And after restarting postgres or even pg_checksums, you'll just get a
> > message that there's no control file. How on earth is a normal user
> > supposed to recover from that?  Now, you could have a check for the
> > control file under the temporary name, and emit a hint about renaming,
> > but that has its own angers (like people renaming it just to start
> > postgres).
> 
> Ok, thanks for explaining. 
> 
> I guess if we check for the temporary name in postmaster during startup
> if pg_control isn't there then a more generally useful name like
> "pg_control.maintenance" should be picked. We could then spit out a nice
> error message or hint like "the cluster has been disabled for
> maintenance. In order to start it up anyway, rename
> pg_control.maintenance to pg_control" or so.

To be very clear: I am going to try to stop any patch with this
mechanism from going into the tree. I think it's an absurdly bad
idea. There'd need to be significant support from a number of other
committers to convince me otherwise.


> In any case, this would be more of a operational or availability issue
> and not a data-loss issue, as I feared from your previous mails.

It's just about undistinguishable for normal users.


> > And you're basically adding it because Fabien doesn't like
> > postmaster.pid and wants to invent another lockout mechanism in this
> > thread.
> 
> I think the hazard of another DBA (or some automated configuration
> management or HA tool for that matter) looking at postmaster.pid,
> deciding that it is not a legit file from a running instance, deleting
> it and then starting up Postgres while pg_checksums is still at work is
> worse than the above scenario, but maybe if we make the content of
> postmaster.pid clear enough (like "maintenance in progress"?) it would
> be enough of a hint?

Err, how would such a tool decide to do that safely? And if it did so,
how would it not cause problems in postgres' normal operation, given
that that postmaster.pid is crucial to prevent two postgres instances
running at the same time?


> Or do you have concrete suggestions on how this should work?

create a postmaster.pid with the pid of pg_checksums. That'll trigger a
postgres start from checking whether that pid is still alive. There'd
probably need to be a tiny change to CreateLockFile() to prevent it from
checking whether any shared m

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Tom Lane

David Fetter  writes:
> I think the answer is "no," and we should deprecate this misfeature.
> It's bad enough that we'll be supporting it for five years after
> deprecating it, but it's worse to leave it hanging around our necks
> forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)

The problem with that approach is that not everybody agrees that
it's a misfeature.

regards, tom lane

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 12:51:39 -0400, Tom Lane wrote:
> David Fetter  writes:
> > I think the answer is "no," and we should deprecate this misfeature.
> > It's bad enough that we'll be supporting it for five years after
> > deprecating it, but it's worse to leave it hanging around our necks
> > forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)
> 
> The problem with that approach is that not everybody agrees that
> it's a misfeature.

Yea, it's extremely useful to just be able to send a whole script to the
server. Otherwise every application wanting to do so needs to be able to
split SQL statements, not exactly a trivial task. And the result will be
slower, due to increased rountrips.

Greetings,

Andres Freund

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Alvaro Herrera

On 2019-Mar-19, Andres Freund wrote:

> Hi,
> 
> On 2019-03-19 12:51:39 -0400, Tom Lane wrote:
> > David Fetter  writes:
> > > I think the answer is "no," and we should deprecate this misfeature.
> > > It's bad enough that we'll be supporting it for five years after
> > > deprecating it, but it's worse to leave it hanging around our necks
> > > forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)
> > 
> > The problem with that approach is that not everybody agrees that
> > it's a misfeature.
> 
> Yea, it's extremely useful to just be able to send a whole script to the
> server. Otherwise every application wanting to do so needs to be able to
> split SQL statements, not exactly a trivial task. And the result will be
> slower, due to increased rountrips.

I suppose it can be argued that for the cases where they want that, it
is not entirely ridiculous to have it be done with a different API call,
say PQexecMultiple.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread David Fetter

On Tue, Mar 19, 2019 at 01:59:34PM -0300, Alvaro Herrera wrote:
> On 2019-Mar-19, Andres Freund wrote:
> 
> > Hi,
> > 
> > On 2019-03-19 12:51:39 -0400, Tom Lane wrote:
> > > David Fetter  writes:
> > > > I think the answer is "no," and we should deprecate this misfeature.
> > > > It's bad enough that we'll be supporting it for five years after
> > > > deprecating it, but it's worse to leave it hanging around our necks
> > > > forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)
> > > 
> > > The problem with that approach is that not everybody agrees that
> > > it's a misfeature.
> > 
> > Yea, it's extremely useful to just be able to send a whole script to the
> > server. Otherwise every application wanting to do so needs to be able to
> > split SQL statements, not exactly a trivial task. And the result will be
> > slower, due to increased rountrips.
> 
> I suppose it can be argued that for the cases where they want that, it
> is not entirely ridiculous to have it be done with a different API call,
> say PQexecMultiple.

Renaming it to emphasize that it's a non-default choice seems like a
large step in the right direction.

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 13:59:34 -0300, Alvaro Herrera wrote:
> On 2019-Mar-19, Andres Freund wrote:
> > On 2019-03-19 12:51:39 -0400, Tom Lane wrote:
> > > David Fetter  writes:
> > > > I think the answer is "no," and we should deprecate this misfeature.
> > > > It's bad enough that we'll be supporting it for five years after
> > > > deprecating it, but it's worse to leave it hanging around our necks
> > > > forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)
> > > 
> > > The problem with that approach is that not everybody agrees that
> > > it's a misfeature.
> > 
> > Yea, it's extremely useful to just be able to send a whole script to the
> > server. Otherwise every application wanting to do so needs to be able to
> > split SQL statements, not exactly a trivial task. And the result will be
> > slower, due to increased rountrips.
> 
> I suppose it can be argued that for the cases where they want that, it
> is not entirely ridiculous to have it be done with a different API call,
> say PQexecMultiple.

Sure, but what'd the gain be? Using PQexecParams() already enforces that
there's only a single command. Sure, explicit is better than implicit
and all that, but is that justification for breaking a significant
number of applications?

Greetings,

Andres Freund

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Andres Freund

On 2019-03-19 10:02:33 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-03-19 13:59:34 -0300, Alvaro Herrera wrote:
> > On 2019-Mar-19, Andres Freund wrote:
> > > On 2019-03-19 12:51:39 -0400, Tom Lane wrote:
> > > > David Fetter  writes:
> > > > > I think the answer is "no," and we should deprecate this misfeature.
> > > > > It's bad enough that we'll be supporting it for five years after
> > > > > deprecating it, but it's worse to leave it hanging around our necks
> > > > > forever. https://en.wikipedia.org/wiki/Albatross_(metaphor)
> > > > 
> > > > The problem with that approach is that not everybody agrees that
> > > > it's a misfeature.
> > > 
> > > Yea, it's extremely useful to just be able to send a whole script to the
> > > server. Otherwise every application wanting to do so needs to be able to
> > > split SQL statements, not exactly a trivial task. And the result will be
> > > slower, due to increased rountrips.
> > 
> > I suppose it can be argued that for the cases where they want that, it
> > is not entirely ridiculous to have it be done with a different API call,
> > say PQexecMultiple.
> 
> Sure, but what'd the gain be? Using PQexecParams() already enforces that
> there's only a single command. Sure, explicit is better than implicit
> and all that, but is that justification for breaking a significant
> number of applications?

In short: I think we should just remove this todo entry.  If somebody
feels like we should do something, I guess making the dangers of
PQexec() vs PQexecPrepared() even clearer would be the best thing to
do. Although I actually find it easy enough, it's not like we're holding
back:

https://www.postgresql.org/docs/devel/libpq-exec.html

PQexec():

The command string can include multiple SQL commands (separated by semicolons). 
Multiple queries sent in a single PQexec call are processed in a single 
transaction, unless there are explicit BEGIN/COMMIT commands included in the 
query string to divide it into multiple transactions. (See Section 52.2.2.1 for 
more details about how the server handles multi-query strings.) Note however 
that the returned PGresult structure describes only the result of the last 
command executed from the string. Should one of the commands fail, processing 
of the string stops with it and the returned PGresult describes the error 
condition.

PQexecParams():
Unlike PQexec, PQexecParams allows at most one SQL command in the given string. 
(There can be semicolons in it, but not more than one nonempty command.) This 
is a limitation of the underlying protocol, but has some usefulness as an extra 
defense against SQL-injection attacks.

Re: jsonpath

2019-03-19 Thread Alexander Korotkov

On Sun, Mar 17, 2019 at 6:03 PM Tom Lane  wrote:
> Andrew Dunstan  writes:
> > Why are we installing the jsonpath_gram.h file? It's not going to be
> > used by anything else, is it? TBH, I'm not sure I see why we're
> > installing the scanner.h file either.
>
> As near as I can see, jsonpath_gram.h and jsonpath_scanner.h exist only
> for communication between jsonpath_gram.y and jsonpath_scan.l.  Maybe
> we'd be better off handling that need by compiling the .l file as part
> of the .y file, as we used to do with the core lexer and still do with
> several others (cf comments for commit 72b1e3a21); then we wouldn't
> even have to generate these files much less install them.
>
> A quick look at jsonpath_scan.c shows that it's pretty innocent of
> the tricks we've learned over the years for flex/bison portability;
> in particular I see that it's #including  before postgres.h,
> which is a no-no.  So that whole area needs more review anyway.

Attached patch is getting rid of jsonpath_gram.h.  Would like to see
results of http://commitfest.cputube.org/ before committing, because
I'm not available to test this of windows machine.  There would be
further patches rearranging jsonpath_gram.y and jsonpath_scan.l.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


0001-get-rid-of-jsonpath_gram_h.patch
Description: Binary data

Automated way to find actual COMMIT LSN of subxact LSN

2019-03-19 Thread Jeremy Finzel

I want to build automation to recover a database to a specific LSN
*inclusive*, even if that LSN is from a subtransaction.  The problem I am
facing is that I know what specific LSN wrote a row on a remote system, but
if I create a recovery.conf file with:

recovery_target_lsn = '95F/BBA36DF8'

and 95F/BBA36DF8 is actually a subtransaction, then even if I use default
behavior of recovery_target_inclusive = true, that transaction will NOT be
included in the restore point, because it is prior to the actual COMMIT LSN
of which this lsn/subxact is a part.

My hack for now is to simply manually scan down until I find the COMMIT,
which is the only way so far I can figure to find it out.  I don't want to
hack some kind of search script based on this if there is already a better
way to get this information... anyone know of a way?

Thank you,
Jeremy

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Tom Lane

Andres Freund  writes:
> On 2019-03-19 13:59:34 -0300, Alvaro Herrera wrote:
>> I suppose it can be argued that for the cases where they want that, it
>> is not entirely ridiculous to have it be done with a different API call,
>> say PQexecMultiple.

> Sure, but what'd the gain be? Using PQexecParams() already enforces that
> there's only a single command. Sure, explicit is better than implicit
> and all that, but is that justification for breaking a significant
> number of applications?

Right, the tradeoff here comes down to breaking existing apps vs.
adding security for poorly-written apps.  Whether you think it's
worthwhile to break stuff depends on your estimate of how common
poorly-written apps are.  To that point, I'd be inclined to throw
David's previous comment back at him: they're likely not that
common.  A well-written app should probably be treating insecure
inputs as parameters in PQexecParams anyhow, making this whole
discussion moot.

Having said that ... a better argument for a new API is that it
could be explicitly designed to handle multiple queries, and in
particular make some provision for returning multiple PGresults.
Maybe if we had that there would be more support for deprecating
the ability to send multiple queries in plain PQexec.  It'd still
be a long time before we could turn it off though, at least by
default.

regards, tom lane

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-19 13:18:25 -0400, Tom Lane wrote:
> Having said that ... a better argument for a new API is that it
> could be explicitly designed to handle multiple queries, and in
> particular make some provision for returning multiple PGresults.

Oh, I completely agree, that'd be hugely useful.

Greetings,

Andres Freund

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread David Fetter

On Tue, Mar 19, 2019 at 01:18:25PM -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2019-03-19 13:59:34 -0300, Alvaro Herrera wrote:
> >> I suppose it can be argued that for the cases where they want that, it
> >> is not entirely ridiculous to have it be done with a different API call,
> >> say PQexecMultiple.
> 
> > Sure, but what'd the gain be? Using PQexecParams() already enforces that
> > there's only a single command. Sure, explicit is better than implicit
> > and all that, but is that justification for breaking a significant
> > number of applications?
> 
> Right, the tradeoff here comes down to breaking existing apps vs.
> adding security for poorly-written apps.  Whether you think it's
> worthwhile to break stuff depends on your estimate of how common
> poorly-written apps are.  To that point, I'd be inclined to throw
> David's previous comment back at him: they're likely not that
> common.  A well-written app should probably be treating insecure
> inputs as parameters in PQexecParams anyhow, making this whole
> discussion moot.
> 
> Having said that ... a better argument for a new API is that it
> could be explicitly designed to handle multiple queries, and in
> particular make some provision for returning multiple PGresults.

That sounds like it'd be *really* handy if one were building a
client-side retry framework. People will be doing (the equivalent of)
this as the vulnerabilities inherent in isolation levels lower than
SERIALIZABLE become better known.
https://www.cockroachlabs.com/blog/acid-rain/

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: extensions are hitting the ceiling

2019-03-19 Thread Eric Hanson

On Mon, Mar 18, 2019 at 11:56 PM Chapman Flack 
wrote:

> On 03/18/19 22:38, Eric Hanson wrote:
> > rows are really second class citizens:  They aren't tracked with
> > pg_catalog.pg_depend, they aren't deleted when the extension is dropped,
> > etc.
>
> This. You have other interests as well, but this is the one I was thinking
> about a few years ago in [1] (starting at "Ok, how numerous would be the
> problems with this:").
>

Cool!

First thoughts, it seems like a sensible way to go given the premise that
extensions are immutable.  But -- I'd be a little concerned about the
performance ramifications.  Usually there are not jillions of database
objects in a extension, but if they started containing data, there sure
could be jillions of rows.  Every row would have to be checked for
existence as part of an extension on every insert or update, no?

Nobody ever chimed in to say how numerous they did or didn't think the
> problems would be. I was actually thinking recently about sitting down
> and trying to write that patch, as no one had exactly stood up to say
> "oh heavens no, don't write that." But my round tuits are all deployed
> elsewhere at the moment.
>

Likewise, if nobody tells me "oh sheeze extensions can already do all this"
I'm going to assume they can't. :-)

I'd still like to discuss the ideas.

Me too!

Ok, I should probably come out and say it:  I think the user story of
"There is some kind of packaging system that can contain both schema and
data, and these packages can be installed and removed along with their
dependencies atomically" is fairly obvious and desirable.  But getting
there while accepting the premises that are currently baked into extensions
might be a tall order.

Extensions have a middleware-ish aspect to them -- they are immutable and
that immutability is checked and enforced at runtime.  That might scale
just fine to a few dozen database objects that only check pg_depends on DDL
operations, but if we introduce record tracking and start sticking sticks
into the wheels of the DML, things could go south really quickly it seems.

I really like a more git-like pattern, where you are free to modify the
working copy of a repository (or in this case an extension), and instead of
being blocked from changing things, the system tells the user what has
changed and how, and gives sensible options for what to do about it.  That
way it doesn't incur a performance hit, and the user can do a kind of "git
status" on their extension to show any changes.

How about an extension system whose first principle is that an extension is
made up of rows, period.  What about the DDL you ask?  Well...

Imagine a system catalog whose sole purpose is to contain database object
definitions like "CREATE VIEW ...", similar to those produced by
pg_catalog.pg_get_viewdef(), pg_catalog.get_functiondef(), etc.  Let's call
this catalog `def`. There is exactly one VIEW for every type of database
object in PostgreSQL. def.table, def.role, def.sequence, def.operator,
def.type, etc. Each def.* VIEW contains only two columns, `id` and
`definition`.  The `id` column contains a unique identifier for the object,
and the `definition` column contains the SQL statement that will recreate
the object.

So, inside this system catalog is the SQL definition statement of every
database object.  In theory, the contents of all the `definition` columns
together would be similar to the contents of pg_dump --schema-only.

Now, imagine all these def.* views had insert triggers, so that on insert,
it actually executes the contents of the `definition` column.  In theory,
we could pg_restore the data in the def.* views, and it would recreate all
the database objects. It could shift all that logic out of pg_dump and into
the database.

So using the def.* catalog, we could package both "regular" table data and
system objects via the contents of the def.* catalog views.  Packages are a
collection rows, period. Build up from there.

I'm working on a prototype called bundle [1], it still has a ways to go but
it's showing some promise.  It is going to require brining into PostgreSQL
the missing pg_get_*def functions, as folks have talked about before [2].

Thanks,
Eric

[1]
https://github.com/aquametalabs/aquameta/tree/master/src/pg-extension/bundle
[2]
https://www.postgresql.org/message-id/20130429234634.ga10...@tornado.leadboat.com

Re: Concurrency bug with vacuum full (cluster) and toast

2019-03-19 Thread Alexander Korotkov

On Tue, Mar 19, 2019 at 6:48 PM Robert Haas  wrote:
> On Mon, Mar 18, 2019 at 12:53 PM Alexander Korotkov
>  wrote:
> > I've discovered bug, when vacuum full fails with error, because it
> > couldn't find toast chunks deleted by itself.  That happens because
> > cluster_rel() sets OldestXmin, but toast accesses gets snapshot later
> > and independently.  That causes heap_page_prune_opt() to clean chunks,
> > which rebuild_relation() expects to exist.  This bug very rarely
> > happens on busy systems which actively update toast values.  But I
> > found way to reliably reproduce it using debugger.
>
> Boy, I really feel like we've talked about this before.  These are
> somewhat-related discussions, but none of them are exactly the same
> thing:
>
> http://postgr.es/m/1335.1304187...@sss.pgh.pa.us
> http://postgr.es/m/20362.1359747...@sss.pgh.pa.us
> http://postgr.es/m/87in8nec96@news-spur.riddles.org.uk
>
> I don't know whether we've actually talked about this precise problem
> before and I just can't find the thread, or whether I'm confusing what
> you've found here with some closely-related issue.

Thank you for pointing, but none of the threads you pointed describe
this exact problem.  Now I see this bug have a set of cute siblings :)

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Tom Lane

Andres Freund  writes:
> On 2019-03-19 13:18:25 -0400, Tom Lane wrote:
>> Having said that ... a better argument for a new API is that it
>> could be explicitly designed to handle multiple queries, and in
>> particular make some provision for returning multiple PGresults.

> Oh, I completely agree, that'd be hugely useful.

Of course, you can do that already with PQsendQuery + a loop
around PQgetResult.  So the question here is whether that can
be wrapped up into something easier-to-use.  I'm not entirely
sure what that might look like.

We should also keep in mind that there's a perfectly valid
use-case for wanting to send a big script of commands and
just check for overall success or failure.  So it's not like
PQexec's current behavior has *no* valid uses.

regards, tom lane

Re: [HACKERS] CLUSTER command progress monitor

2019-03-19 Thread Robert Haas

On Mon, Mar 18, 2019 at 10:03 PM Tatsuro Yamada
 wrote:
> Attached patch is a rebased document patch. :)

Attached is an updated patch.  I went through this patch carefully
today, in the hopes of committing it, and I think the attached version
is pretty closet to being committable, but there's at least one open
issue remaining, as described below.

- The regression tests did not pass because expected/rules.out was not
properly updated.  I fixed that.

- The documentation did not build because some tags were not properly
terminated e.g.  rather than .  I also fixed that.

- The documentation had two nearly-identical lists of phases.  I
merged them into one.  There might be room for some further
fine-tuning here.

- cluster_rel() had multiple places where it could return without
calling pgstat_progress_end_command().  I fixed that.

- cluster_rel() inexplicably delayed updating PROGRESS_CLUSTER_COMMAND
for longer than seems necessary.  I fixed that.

- copy_heap_data() zeroed out the heap-tuples-scanned,
heap-blocks-scanned, and total-heap-blocks counters when it began
PROGRESS_CLUSTER_PHASE_SORT_TUPLES and
PROGRESS_CLUSTER_PHASE_SWAP_REL_FILES.  This seems like throwing away
useful information for no good reason.  I changed it not to do that in
all cases except the one mentioned in the next paragraph.

- It *is* currently to reset PROGRESS_CLUSTER_HEAP_TUPLES_SCANNED
because that counter gets reused to indicate the number of heap tuples
*written back out*, but I think that is bad design for two reasons.
First, the documentation does not explain that sometimes the number of
heap tuples scanned is really reporting the number of heap tuples
written.  Second, it's bad for columns to have misleading names.
Third, it would actually be really useful to store these values in
separate columns, because then you could expect that the number tuples
written would eventually equal the number scanned, and you'd still
have the number that were scanned around so that you could clearly see
how close you were getting to rewriting the entire heap.  This is the
one thing I found but did not fix; any chance you could make this
change and update the documentation to match?

- The comment about reporting that we are now reindexing relations was
jammed in between an existing comment and the associated code.  I
moved it to a more logical place.

- The new if-statement in pg_stat_get_progress_info was missing a
space required by project style.  I added the space.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


progress_monitor_for_cluster_command_v11.patch
Description: Binary data

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-19 Thread Robert Haas

On Tue, Mar 19, 2019 at 1:24 PM Jim Finnerty  wrote:
> The queryId depends on oids, so it is not stable enough for some purposes.
> For example, to create a SQL identifier that survives across a server
> upgrade, or that can be shipped to another database, the queryId isn't
> usable.
>
> The apg_plan_mgmt extensions keeps both its own stable SQL identifier as
> well as the queryId, so it can be used to join to pg_stat_statements if
> desired.  If we were to standardize on one SQL identifier, it should be
> stable enough to survive a major version upgrade or to be the same in
> different databases.

If Amazon would like to open-source its (AIUI) proprietary technology
for computing query IDs and propose it for inclusion in PostgreSQL,
cool, but I think that is a separate question from whether people
would like more convenient access to the query ID technology that we
have today.  I think it's 100% clear that they would like that, even
as things stand, and therefore it does not make sense to block that
behind Amazon deciding to share what it already has or somebody else
trying to reimplement it.

If we need to have a space for both a core-standard query ID and
another query ID that is available for extension use, adding one more
field to struct Query, so we can have both coreQueryId and
extensionQueryId or whatever, would be easy to do.  It appears that
there's more use case than I would have guessed for custom query IDs.
On the other hand, it also appears that a lot of people would be very,
very happy to just be able to see the query ID field that already
exists, both in pg_stat_statements in pg_stat_activity, and we
shouldn't throw up unnecessary impediments in the way of making that
happen, at least IMHO.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Willing to fix a PQexec() in libpq module

2019-03-19 Thread Daniel Verite

Tom Lane wrote:

> Unfortunately, if the default behavior doesn't change, then there's little
> argument for doing this at all.  The security reasoning behind doing
> anything in this area would be to provide an extra measure of protection
> against SQL-injection attacks on carelessly-written clients, and of course
> it's unlikely that a carelessly-written client would get changed to make
> use of a non-default feature.

A patch introducing an "allow_multiple_queries" GUC to
control this was proposed and eventually rejected for lack of
consensus some time ago (also there were some concerns about
the implementation that might have played against it too):

https://www.postgresql.org/message-id/CALAY4q_eHUx%3D3p1QUOvabibwBvxEWGm-bzORrHA-itB7MBtd5Q%40mail.gmail.com

About the effectiveness of this feature, there's a valid use case in
which it's not the developers who decide to set this GUC, but the DBA
or the organization deploying the application. That applies to
applications that of course do not intentionally use multiple queries
per command.
That would provide a certain level a protection against SQL
injections, without changing the application or libpq or breaking
backward compatibility, being optional.

But both in this thread and the other thread, the reasoning about the
GUC seems to make the premise that applications would
need to be updated or developpers need to be aware of it,
as if they _had_ to issue SET allow_multiple_queries TO off/on,
rather than being submitted to it, as imposed upon them by
postgresql.conf or the database settings.

If we compare this to, say, lo_compat_privileges. An application
typically doesn't get to decide whether it's "on". It's for a
superuser to decide which databases or which users must operate with
this setting to "on".
Why wouldn't that model work for disallowing multiple queries per command?

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Online verification of checksums

2019-03-19 Thread Stephen Frost

Greetings,

On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:

> Hi,
>
> On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > It's torn pages that I am concerned about - the server is writing and
> > > we are reading, and we get a mix of old and new content.  We have been
> > > quite diligent about protecting ourselves from such risks elsewhere,
> > > and checksum verification should not be held to any lesser standard.
> >
> > If we see a checksum failure on an otherwise correctly read block in
> > online mode, we retry the block on the theory that we might have read a
> > torn page.  If the checksum verification still fails, we compare its LSN
> > to the LSN of the current checkpoint and don't mind if its newer.  This
> > way, a torn page should not cause a false positive either way I
> > think?.
>
> False positives, no. But there's plenty potential for false
> negatives. In plenty clusters a large fraction of the pages is going to
> be touched in most checkpoints.

How is it a false negative?  The page was in the middle of being written,
if we crash the page won’t be used because it’ll get replayed over by the
checkpoint, if we don’t crash then it also won’t be used until it’s been
written out completely.  I don’t agree that this is in any way a false
negative- it’s simply a page that happens to be in the middle of a file
that we can skip because it isn’t going to be used. It’s not like there’s
going to be a checksum failure if the backend reads it.

Not only that, but checksums and such failures are much more likely to
happen on long dormant data, not on data that’s actively being written out
and therefore is still in the Linux FS cache and hasn’t even hit actual
storage yet anyway.

>  If it is a genuine storage failure we will see it in the next
> > pg_checksums run as its LSN will be older than the checkpoint.
>
> Well, but also, by that time it might be too late to recover things. Or
> it might be a backup that you just made, that you later want to recover
> from, ...

If it’s a backup you just made then that page is going to be in the WAL and
the torn page on disk isn’t going to be used, so how is this an issue?
This is why we have WAL- to deal with torn pages.

> The basebackup checksum verification works in the same way.
>
> Shouldn't have been merged that way.

I have a hard time not finding this offensive.  These issues were
considered, discussed, and well thought out, with the result being
committed after agreement.

Do you have any example cases where the code in pg_basebackup has resulted
in either a false positive or a false negative?  Any case which can be
shown to result in either?

If not then I think we need to stop this, because if we can’t trust that a
torn page won’t be actually used in that torn state then it seems likely
that our entire WAL system is broken and we can’t trust the way we do
backups either and have to rewrite all of that to take precautions to lock
pages while doing a backup.

Thanks!

Stephen

>

Re: Rare SSL failures on eelpout

2019-03-19 Thread Tom Lane

Thomas Munro  writes:
> On Tue, Mar 19, 2019 at 9:11 AM Tom Lane  wrote:
>>> One thing that isn't real clear to me is how much timing sensitivity
>>> there is in "when no more input data is available".  Can we assume that
>>> if we've gotten ECONNRESET or an allied error from a write, then any
>>> data the far end might've wished to send us is already available to
>>> read?

> Following a trail beginning at
> https://en.wikipedia.org/wiki/Transmission_Control_Protocol I see that
> RFC 1122 4.2.2.13 discusses this topic and possible variations in this
> area.  I don't know enough about any of this stuff to draw hard
> conclusions from primary sources, but it does appear that an
> implementation might be within its rights to jettison that data,
> unfortunately.

I spent some time looking at that, and as far as I can see, the
behavior reported for Windows is flat out forbidden by TCP.  RFC1122
does discuss the possibility that an O/S might not support half-closed
connections; but that would only matter if (in Unix terms) we issued
shutdown(sock, SHUT_WR) and expected to continue to be able to read,
which we do not.  In any other scenario, TCP is supposed to deliver
any data it successfully received.

We can't, of course, work magic and retrieve data the TCP stack never
got, nor is there much we can do about it if the stack throws away
data despite the spec.  So I'm not inclined to fuss about the corner
cases.  The main thing for us is to ensure that if a server error
message is available to read, we do so and return it rather than
returning a less-helpful bleat about being unable to write.  The
proposed patch does that.  It will also tend to report bleats about
read failure rather than write failure, even if from libpq's perspective
the write failure happened first; but that seems fine to me.

>>> The reason I'm concerned is that I don't think it'd be bright to ignore a
>>> send error until we see input EOF, which'd be the obvious way to solve a
>>> timing problem if there is one.  If our send error was some transient
>>> glitch and the connection is really still open, then we might not get EOF,
>>> but we won't get a server reply either because no message went to the
>>> server.  You could imagine waiting some small interval of time before
>>> deciding that it's time to report the write failure, but ugh.

I'm likewise inclined to dismiss this worry, because I don't see
how it could happen, given that the server doesn't use shutdown(2).
A send error should imply either that the kernel saw RST from the
remote end, or that the connection is local and the kernel knows that
the server closed its socket or crashed.  In either scenario the
kernel should consider that the incoming data stream is ended; maybe
it'll give us whatever data it received or maybe not, but it shouldn't
allow us to sit and wait for more data.

Or in other words: there are no client-visible "transient glitches" in
TCP.  Either the connection is still up, or it's been reset.

So I'm inclined to (1) commit the patch as-proposed in HEAD, and
(2) hack the ssl test cases in v11 as you suggested.  If we see field
complaints about this, we can consider reverting (2) in favor of
a back-patch once v12 beta is over.

regards, tom lane

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-03-19 Thread legrand legrand

Great, 
thank you Julien !

Would it make sense to add it in auto explain ?
I don't know for explain itself, but maybe ...

Regards
PAscal




--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Rare SSL failures on eelpout

2019-03-19 Thread Thomas Munro

On Wed, Mar 20, 2019 at 8:31 AM Tom Lane  wrote:
> So I'm inclined to (1) commit the patch as-proposed in HEAD, and
> (2) hack the ssl test cases in v11 as you suggested.  If we see field
> complaints about this, we can consider reverting (2) in favor of
> a back-patch once v12 beta is over.

This plan looks good to me.

-- 
Thomas Munro
https://enterprisedb.com

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

Hi,

On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:
> > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > It's torn pages that I am concerned about - the server is writing and
> > > > we are reading, and we get a mix of old and new content.  We have been
> > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > and checksum verification should not be held to any lesser standard.
> > >
> > > If we see a checksum failure on an otherwise correctly read block in
> > > online mode, we retry the block on the theory that we might have read a
> > > torn page.  If the checksum verification still fails, we compare its LSN
> > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > way, a torn page should not cause a false positive either way I
> > > think?.
> >
> > False positives, no. But there's plenty potential for false
> > negatives. In plenty clusters a large fraction of the pages is going to
> > be touched in most checkpoints.
> 
> 
> How is it a false negative?  The page was in the middle of being
> written,

You don't actually know that. It could just be random gunk in the LSN,
and this type of logic just ignores such failures as long as the random
gunk is above the system's LSN.

And the basebackup logic doesn't just ignore if both the checksum
failed, and the lsn is between startptr and current insertion pointer -
it just does it with *any* page that has a pd_upper != 0 and a pd_lsn >
startptr. Given typical startlsn values (skewing heavily towards lower
int64s), that means that random data is more likely than not to pass
this test.

As it stands, the logic seems to give more false confidence than
anything else.

> > The basebackup checksum verification works in the same way.
> >
> > Shouldn't have been merged that way.
> 
> 
> I have a hard time not finding this offensive.  These issues were
> considered, discussed, and well thought out, with the result being
> committed after agreement.

Well, I don't know what to tell you. But:

/*
 * Only check pages which have not been 
modified since the
 * start of the base backup. Otherwise, they 
might have been
 * written only halfway and the checksum would 
not be valid.
 * However, replaying WAL would reinstate the 
correct page in
 * this case. We also skip completely new 
pages, since they
 * don't have a checksum yet.
 */
if (!PageIsNew(page) && PageGetLSN(page) < 
startptr)
{

doesn't consider plenty scenarios, as pointed out above.  It'd be one
thing if the concerns I point out above were actually commented upon and
weighed not substantial enough (not that I know how). But...

> Do you have any example cases where the code in pg_basebackup has resulted
> in either a false positive or a false negative?  Any case which can be
> shown to result in either?

CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
100) g(i);
SELECT pg_relation_size('corruptme');
postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
pg_relation_filepath('corruptme');
┌─┐
│  ?column?   │
├─┤
│ /srv/dev/pgdev-dev/base/13390/16384 │
└─┘
(1 row)
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
conv=notrunc

Try a basebackup and see how many times it'll detect the corrupt
data. In the vast majority of cases you're going to see checksum
failures when reading the data for normal operation, but not when using
basebackup (or this new tool).

At the very very least this would need to do

a) checks that the page is all zeroes if PageIsNew() (like
   PageIsVerified() does for the backend). That avoids missing cases
   where corruption just zeroed out the header, but not the whole page.
b) Check that pd_lsn is between startlsn and the insertion pointer. That
   avoids accepting just about all random data.

And that'd *still* be less strenuous than what normal backends
check. And that's already not great (due to not noticing zeroed out
data).

I fail to see how it's offensive to describe this as "shouldn't have
been merged that way".

Greetings,

Andres Freund

Re: Online verification of checksums

2019-03-19 Thread Andres Freund

On 2019-03-19 13:00:50 -0700, Andres Freund wrote:
> As it stands, the logic seems to give more false confidence than
> anything else.

To demonstrate that I ran a loop that verified that a) a normal backend
query using the tale detects the corruption b) pg_basebackup doesn't.

i=0;
while true; do
i=$(($i+1));
echo attempt $i;
dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
conv=notrunc 2>/dev/null;
psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X 
fetch -F t -D - -c fast > /dev/null || break;
done

(excuse the crappy one-off sh)

had, during ~12k iterations, always detected the corruption in the
backend, and never via pg_basebackup. Given the likely LSNs in a
cluster, that's not too surprising.

Greetings,

Andres Freund

[survey] New "Stable" QueryId based on normalized query text

2019-03-19 Thread legrand legrand

Hello,

There are many projects that use alternate QueryId 
distinct from the famous pg_stat_statements jumbling algorithm.

https://github.com/postgrespro/aqo
query_hash

https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Optimize.ViewPlans.html
sql_hash

https://github.com/ossc-db/pg_hint_plan
queryid

Even pg_stat_statement has a normalize function, 
that would answer the current question ...

Here are some *needs* :

needs.1: stable accross different databases,
needs.2: doesn't change after database or object rebuild,
needs.3: search_path / schema independant,
needs.4: pg version independant (as long as possible),
...

and some *normalization rules*:

norm.1: case insensitive
norm.2: blank reduction 
norm.3: hash algoritm ?
norm.4: CURRENT_DATE, CURRENT_TIME, LOCALTIME, LOCALTIMESTAMP not normalized
norm.5: NULL, IS NULL not normalized ?
norm.6: booleans t, f, true, false not normalized
norm.7: order by 1,2 or group by 1,2 should not be normalized
norm.8: pl/pgsql anonymous blocks not normalized
norm.9: comments aware

Do not hesitate to add your thougths
Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Online verification of checksums

2019-03-19 Thread Robert Haas

On Tue, Mar 19, 2019 at 4:49 PM Andres Freund  wrote:
> To demonstrate that I ran a loop that verified that a) a normal backend
> query using the tale detects the corruption b) pg_basebackup doesn't.
>
> i=0;
> while true; do
> i=$(($i+1));
> echo attempt $i;
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc 2>/dev/null;
> psql -X -c 'SELECT * FROM corruptme;' 2>/dev/null && break;
> ~/build/postgres/dev-assert/vpath/src/bin/pg_basebackup/pg_basebackup -X 
> fetch -F t -D - -c fast > /dev/null || break;
> done
>
> (excuse the crappy one-off sh)
>
> had, during ~12k iterations, always detected the corruption in the
> backend, and never via pg_basebackup. Given the likely LSNs in a
> cluster, that's not too surprising.

Wow.  So we shipped a checksum-verification feature (in pg_basebackup)
that reliably fails to detect blatantly corrupt pages.  That's pretty
awful.  Your chances get better the more WAL you've ever generated,
but you have to generate 163 petabytes of WAL to have a 1% chance of
detecting a page of random garbage, so realistically they never get
very good.

It's probably fair to point out that flipping a couple of random bytes
on the page is a more likely error than replacing the entire page with
garbage, and the check as designed will detect that fairly reliably --
unless those bytes are very near the beginning of the page.  Still,
that leaves a lot of kinds of corruption that this will not catch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Online verification of checksums

2019-03-19 Thread Michael Banck

Hi,

Am Dienstag, den 19.03.2019, 13:00 -0700 schrieb Andres Freund:
> On 2019-03-20 03:27:55 +0800, Stephen Frost wrote:
> > On Tue, Mar 19, 2019 at 23:59 Andres Freund  wrote:
> > > On 2019-03-19 16:52:08 +0100, Michael Banck wrote:
> > > > Am Dienstag, den 19.03.2019, 11:22 -0400 schrieb Robert Haas:
> > > > > It's torn pages that I am concerned about - the server is writing and
> > > > > we are reading, and we get a mix of old and new content.  We have been
> > > > > quite diligent about protecting ourselves from such risks elsewhere,
> > > > > and checksum verification should not be held to any lesser standard.
> > > > 
> > > > If we see a checksum failure on an otherwise correctly read block in
> > > > online mode, we retry the block on the theory that we might have read a
> > > > torn page.  If the checksum verification still fails, we compare its LSN
> > > > to the LSN of the current checkpoint and don't mind if its newer.  This
> > > > way, a torn page should not cause a false positive either way I
> > > > think?.
> > > 
> > > False positives, no. But there's plenty potential for false
> > > negatives. In plenty clusters a large fraction of the pages is going to
> > > be touched in most checkpoints.
> > 
> > 
> > How is it a false negative?  The page was in the middle of being
> > written,
> 
> You don't actually know that. It could just be random gunk in the LSN,
> and this type of logic just ignores such failures as long as the random
> gunk is above the system's LSN.

Right, I think this needs to be taken into account. For pg_basebackup,
that'd be an additional check for GetRedoRecPtr() or something 
in the below check:

[...]

> Well, I don't know what to tell you. But:
> 
>   /*
>* Only check pages which have not been 
> modified since the
>* start of the base backup. Otherwise, they 
> might have been
>* written only halfway and the checksum would 
> not be valid.
>* However, replaying WAL would reinstate the 
> correct page in
>* this case. We also skip completely new 
> pages, since they
>* don't have a checksum yet.
>*/
>   if (!PageIsNew(page) && PageGetLSN(page) < 
> startptr)
>   {
> 
> doesn't consider plenty scenarios, as pointed out above.  It'd be one
> thing if the concerns I point out above were actually commented upon and
> weighed not substantial enough (not that I know how). But...
> 

> > Do you have any example cases where the code in pg_basebackup has resulted
> > in either a false positive or a false negative?  Any case which can be
> > shown to result in either?
> 
> CREATE TABLE corruptme AS SELECT g.i::text AS data FROM generate_series(1, 
> 100) g(i);
> SELECT pg_relation_size('corruptme');
> postgres[22890][1]=# SELECT current_setting('data_directory') || '/' || 
> pg_relation_filepath('corruptme');
> ┌─┐
> │  ?column?   │
> ├─┤
> │ /srv/dev/pgdev-dev/base/13390/16384 │
> └─┘
> (1 row)
> dd if=/dev/urandom of=/srv/dev/pgdev-dev/base/13390/16384 bs=8192 count=1 
> conv=notrunc
> 
> Try a basebackup and see how many times it'll detect the corrupt
> data. In the vast majority of cases you're going to see checksum
> failures when reading the data for normal operation, but not when using
> basebackup (or this new tool).

Right, see above.

> At the very very least this would need to do
> 
> a) checks that the page is all zeroes if PageIsNew() (like
>PageIsVerified() does for the backend). That avoids missing cases
>where corruption just zeroed out the header, but not the whole page.

We can't run pg_checksum_page() on those afterwards though as it would
fire an assertion:

|pg_checksums: [...]/../src/include/storage/checksum_impl.h:194:
|pg_checksum_page: Assertion `!(((PageHeader) (&cpage->phdr))->pd_upper
|== 0)' failed.

But we should count it as a checksum error and generate an appropriate
error message in that case.

> b) Check that pd_lsn is between startlsn and the insertion pointer. That
>avoids accepting just about all random data.

However, for pg_checksums being a stand-alone application it can't just
access the insertion pointer, can it? We could maybe set a threshold
from the last checkpoint after which we consider the pd_lsn bogus. But
what's a good threshold here?
 
And/or we could port the other sanity checks from PageIsVerified:

|if ((p->pd_flags & ~PD_VALID_FLAG_BITS) == 0 &&
|   p->pd_lower <= p->pd_upper &&
|   p->pd_upper <= p->pd_special &&
|   p->pd_special <= BLCKSZ &&
|   p->pd_special == MAXALIGN(p->pd_sp

1 2 >

1 - 100 of 143 matches

Mail list logo