RE: libpq debug log

2019-03-04 Thread Iwata, Aya
Hi,

Thank you for your comments and advice.

I'd like to consider your suggestions.
I am planning to change libpq logging like this;

1. Expand PQtrace() facility and improve libpq logging.

2. Prepare "output level". There are 3 type of levels;
- TRADITIONAL   :  if set, outputs protocol messages
- LEVEL1:  if set, outputs phase and time
- LEVEL2:  if set, outputs both info TRADITIONAL and 
LEVEL1

3. Add "output phase" information; This is the processing flow. (ex. When 
PQexec() start and end )

4. Change output method to callback style; Default is stdout, and prepare other 
callback functions that will be used more frequently.

5. Initialization method; 
In current one: PQtrace(PGconn *conn, FILE *stream);
Proposed change: PQtraceEx(PGconn *conn, FILE *stream, PQloggingProcessor 
callback_func , void *callback_arg, PGLogminlevel level);
PQtrace() can be use as it is to consider compatibility with previous 
applications, 
so I leave PQtrace() and created a new function PQtraceEx(). 

After discussing the abovementioned, then maybe we can discuss more about 
enabling trace output and changing the output style.

What do you think? I would appreciate your comments and suggestions.
 
Regards,
Aya Iwata




Re: Minimal logical decoding on standbys

2019-03-04 Thread Amit Khandekar
On Fri, 14 Dec 2018 at 06:25, Andres Freund  wrote:
> I've a prototype attached, but let's discuss the details in a separate
> thread. This also needs to be changed for pluggable storage, as we don't
> know about table access methods in the startup process, so we can't call
> can't determine which AM the heap is from during
> btree_xlog_delete_get_latestRemovedXid() (and sibling routines).

Attached is a WIP test patch
0003-WIP-TAP-test-for-logical-decoding-on-standby.patch that has a
modified version of Craig Ringer's test cases
(012_logical_decoding_on_replica.pl) that he had attached in [1].
Here, I have also attached his original file
(Craigs_012_logical_decoding_on_replica.pl).

Also attached are rebased versions of couple of Andres's implementation patches.

I have added a new test scenario :
DROP TABLE from master *before* the logical records of the table
insertions are retrieved from standby. The logical records should be
successfully retrieved.


Regarding the test result failures, I could see that when we drop a
logical replication slot at standby server, then the catalog_xmin of
physical replication slot becomes NULL, whereas the test expects it to
be equal to xmin; and that's the reason a couple of test scenarios are
failing :

ok 33 - slot on standby dropped manually
Waiting for replication conn replica's replay_lsn to pass '0/31273E0' on master
done
not ok 34 - physical catalog_xmin still non-null
not ok 35 - xmin and catalog_xmin equal after slot drop
#   Failed test 'xmin and catalog_xmin equal after slot drop'
#   at t/016_logical_decoding_on_replica.pl line 272.
#  got:
# expected: 2584



Other than the above, there is this test scenario which I had to remove :

#
# Conflict with recovery: xmin cancels decoding session
#
#
# Start a transaction on the replica then perform work that should cause a
# recovery conflict with it. We'll check to make sure the client gets
# terminated with recovery conflict.
#
# Temporarily disable hs feedback so we can test recovery conflicts.
# It's fine to continue using a physical slot, the xmin should be
# cleared. We only check hot_standby_feedback when establishing
# a new decoding session so this approach circumvents the safeguards
# in place and forces a conflict.

This test starts pg_recvlogical, and expects it to be terminated due
to recovery conflict because hs feedback is disabled.
But that does not happen; instead, pg_recvlogical does not return.

But I am not sure why it does not terminate with Andres's patch; it
was expected to terminate with Craig Ringer's patch.

Further, there are subsequent test scenarios that test pg_recvlogical
with hs_feedback disabled, which I have removed because pg_recvlogical
does not return. I am yet to clearly understand why that happens. I
suspect that is only because hs_feedback is disabled.

Also, the testcases verify pg_controldata's oldestCatalogXmin values,
which are now not present with Andres's patch; so I removed tracking
of oldestCatalogXmin.

[1] 
https://www.postgresql.org/message-id/CAMsr+YEVmBJ=dyLw=+kTihmUnGy5_EW4Mig5T0maieg_Zu=x...@mail.gmail.com

Thanks
-Amit Khandekar


0001-Logical-decoding-on-standby_rebased.patch
Description: Binary data
#!/usr/bin/env perl
# Demonstrate that logical can follow timeline switches.
#
# Test logical decoding on a standby.
#
use strict;
use warnings;
use 5.8.0;

use PostgresNode;
use TestLib;
use Test::More tests => 77;
use RecursiveCopy;
use File::Copy;

my ($stdin, $stdout, $stderr, $ret, $handle, $return);
my $backup_name;

# Initialize master node
my $node_master = get_new_node('master');
$node_master->init(allows_streaming => 1, has_archiving => 1);
$node_master->append_conf('postgresql.conf', q{
wal_level = 'logical'
max_replication_slots = 4
max_wal_senders = 4
log_min_messages = 'debug2'
log_error_verbosity = verbose
# send status rapidly so we promptly advance xmin on master
wal_receiver_status_interval = 1
# very promptly terminate conflicting backends
max_standby_streaming_delay = '2s'
});
$node_master->dump_info;
$node_master->start;

$node_master->psql('postgres', q[CREATE DATABASE testdb]);

$node_master->safe_psql('testdb', q[SELECT * FROM pg_create_physical_replication_slot('decoding_standby');]);
$backup_name = 'b1';
my $backup_dir = $node_master->backup_dir . "/" . $backup_name;
TestLib::system_or_bail('pg_basebackup', '-D', $backup_dir, '-d', $node_master->connstr('testdb'), '--write-recovery-conf', '--slot=decoding_standby');

open(my $fh, "<", $backup_dir . "/recovery.conf")
  or die "can't open recovery.conf";

my $found = 0;
while (my $line = <$fh>)
{
	chomp($line);
	if ($line eq "primary_slot_name = 'decoding_standby'")
	{
		$found = 1;
		last;
	}
}
ok($found, "using physical slot for standby");

sub print_phys_xmin
{
	my $slot = $node_master->slot('decoding_standby');
	return ($slot->{'xmin'}, $slot->{'catalog_xmin'});

RE: pgbench - doCustom cleanup

2019-03-04 Thread Fabien COELHO



Hello Kirk,


so I tried to apply the patch, but part of the chunk failed,
because of the unused line below which was already removed in the
recent related commit.

  PGResult*res;

I removed the line and fixed the other trailing whitespaces.


Indeed. Thanks for the fix.


See the attached latest patch.
The attached patch applies, builds cleanly,
and passes the regression tests.


[...]


No problems on my part as I find the changes logical.
This also needs a confirmation from Alvaro.


Ok.

You switched the patch to "waiting on author": What are you waiting from 
me?


If you think that it is ok and that it should be considered by a 
committer, probably Alvaro, it can be marked as "ready".


--
Fabien.



Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-03-04 Thread Kyotaro HORIGUCHI
Hello.

At Sun, 2 Dec 2018 16:41:06 -0800, Noah Misch  wrote in 
<20181203004106.ga2860...@rfd.leadboat.com>
> On Thu, Nov 29, 2018 at 10:51:40PM -0800, Noah Misch wrote:
> > On Thu, Nov 29, 2018 at 01:10:57PM +0100, Dmitry Dolgov wrote:
> > > As a side note, with this patch recovery tests are failing now on 
> > > 016_shm.pl
> > > 
> > > #   Failed test 'detected live backend via shared memory'
> > > #   at t/016_shm.pl line 87.
> > > #   '2018-11-28 13:08:08.504 UTC [21924] LOG:
> > > listening on Unix socket "/tmp/yV2oDNcG8e/gnat/.s.PGSQL.512"
> > > # 2018-11-28 13:08:08.512 UTC [21925] LOG:  database system was
> > > interrupted; last known up at 2018-11-28 13:08:08 UTC
> > > # 2018-11-28 13:08:08.512 UTC [21925] LOG:  database system was not
> > > properly shut down; automatic recovery in progress
> > > # 2018-11-28 13:08:08.512 UTC [21925] LOG:  invalid record length at
> > > 0/165FEF8: wanted 24, got 0
> > > # 2018-11-28 13:08:08.512 UTC [21925] LOG:  redo is not required
> > > # 2018-11-28 13:08:08.516 UTC [21924] LOG:  database system is ready
> > > to accept connections
> > > # '
> > > # doesn't match '(?^:pre-existing shared memory block)'
> > 
> > Thanks for the report.  Since commit cfdf4dc made pg_sleep() react to
> > postmaster death, the test will need a different way to stall a backend.  
> > This
> > doesn't affect non-test code, and the test still passes against cfdf4dc^ and
> > against REL_11_STABLE.  I've queued a task to update the test code, but 
> > review
> > can proceed in parallel.

I found that I have 65(h) segments left alone on my environment:p

At Sat, 11 Aug 2018 23:48:15 -0700, Noah Misch  wrote in 
<20180812064815.gb2301...@rfd.leadboat.com>
> still doesn't detect.  I could pursue a fix via the aforementioned
> sysv_shmem_key file, modulo the possibility of a DBA removing it.  I could
> also, when postmaster.pid is missing, make sysv_shmem.c check the first N
> (N=100?) keys applicable to the selected port.  My gut feeling is that neither
> thing is worthwhile, but I'm interested to hear other opinions.

# Though I don't get the meaning of the "modulo" there..

I think the only thing we must avoid here is sharing the same
shmem segment with a living-dead server. If we can do that
without the pid file, it would be better than relying on it. We
could remove orphaned segments automatically, but I don't think
we should do that going so far as relying on a dedicated
file. Also, I don't think it's worth stopping shmem id scanning
at a certain number since I don't come up with an appropriate
number for it. But looking "port * 1000", it might be expected
that a usable segment id will found while scanning that number of
ids (1000).

> Here's a version fixing that test for post-cfdf4dc backends.

This moves what were in PGSharedMmoeryIsInUse into a new function
IpcMemoryAnalyze for reusing then adds distinction among
EEXISTS/FOREIGN in not-in-use cases and ATTACHED/ANALYSIS_FAILURE
in an in-use case. UNATTACHED is changed to a not-in-use case by
this patch.  As the result PGSharedMemoryIsInUse() is changed so
that it reutrns "not-in-use" for the UNATTACHED case. It looks
fine.

PGSharedMemoryCreate changed to choose a usable shmem id using
the IpcMemoryAnalyze().  But some of the statuses from
IpcMemoryAnalyze() is concealed by failure of
PGSharedMemoryAttach() and ignored silently opposed to what the
code is intending to do. (By the way SHMSTATE_EEXISTS seems
suggesting oppsite thing from EEXIST, which would be confusing.)

PGSharedMemoryCreate() repeats shmat/shmdt twice in every
iteration. It won't harm so much but it would be better if we
could get rid of that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: speeding up planning with partitions

2019-03-04 Thread Imai, Yoshikazu
Amit-san,

On Fri, Mar 1, 2019 at 1:02 PM, Amit Langote wrote:
> Thanks for testing and sorry it took me a while to reply.

Thanks for working about this late at night. I know you have a lot of things to 
do.

> On 2019/02/25 15:24, Imai, Yoshikazu wrote:
> > [update_pt_with_joining_another_pt.sql]
> > update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b;
> >
> > [pgbench]
> > pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60
> > postgres
> >
> > [results]
> > (part_num_rt, part_num_jrt)  master  patched(0001)
> > ---  --  -
> >   (3, 1024)8.06   5.89
> >   (3, 2048)1.52   0.87
> >   (6, 1024)4.11   1.77
> >
> >
> >
> > With HEAD, we add target inheritance and source inheritance to
> parse->rtable in inheritance_planner and copy and adjust it for child
> tables at beginning of each planning of child tables.
> >
> > With the 0001 patch, we add target inheritance to parse->rtable in
> inheritance_planner and add source inheritance to parse->rtable in
> make_one_rel(under grouping_planner()) during each planning of child
> tables.
> > Adding source inheritance to parse->rtable may be the same process
> between each planning of child tables and it might be useless. OTOH, from
> the POV of making inheritance-expansion-at-bottom better, expanding
> source inheritance in make_one_rel seems correct design to me.
> >
> > How should we do that...?
> 
> To solve this problem, I ended up teaching inheritance_planner to reuse
> the objects for *source* inheritance child relations (RTEs,
> AppendRelInfos, and PlanRowMarks) created during the planning of the 1st
> child query for the planning of subsequent child queries.  Set of source
> child relations don't change between different planning runs, so it's
> okay to do so.  Of course, I had to make sure that query_planner (which
> is not in the charge of adding source inheritance child objects) can notice
> that.

I did above test again with v25 patch and checked the problem is solved.

[results]
(part_num_rt, part_num_jrt)  master  patched(0001)
---  --  -
  (3, 1024)6.11   6.82
  (3, 2048)1.05   1.48
  (6, 1024)3.05   3.45

Sorry that I haven't checked the codes related this problem yet, but I'll check 
it later.


> Please find attached updated patches.  Will update source code comments,
> commit message and perform other fine-tuning over the weekend if possible.

I've taken at glance the codes and there are some minor comments about the 
patch.

* You changed a usage/arguments of get_relation_info, but why you did it? I 
made those codes back to the original code and checked it passes make check. So 
ISTM there are no logical problems with not changing it. Or if you change it, 
how about also change a usage/arguments of get_relation_info_hook in the same 
way?

-get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
- RelOptInfo *rel)
+get_relation_info(PlannerInfo *root, RangeTblEntry *rte, RelOptInfo *rel)
 {
+   boolinhparent = rte->inh;
-   relation = table_open(relationObjectId, NoLock);
+   relation = heap_open(rte->relid, NoLock);
  ...
if (get_relation_info_hook)
-   (*get_relation_info_hook) (root, relationObjectId, inhparent, 
rel);
+   (*get_relation_info_hook) (root, rte->relid, rte->inh, rel);


@@ -217,15 +272,13 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo 
*parent)
/* Check type of rtable entry */
switch (rte->rtekind)
{
case RTE_RELATION:
/* Table --- retrieve statistics from the system 
catalogs */
-   get_relation_info(root, rte->relid, rte->inh, rel);
+   get_relation_info(root, rte, rel);


* You moved the codes of initializing of append rel's partitioned_child_rels in 
set_append_rel_size() to build_simple_rel(), but is it better to do? I also 
checked the original code passes make check by doing like above.

@@ -954,32 +948,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
Assert(IS_SIMPLE_REL(rel));
 
/*
-* Initialize partitioned_child_rels to contain this RT index.
-*
-* Note that during the set_append_rel_pathlist() phase, we will bubble 
up
-* the indexes of partitioned relations that appear down in the tree, so
-* that when we've created Paths for all the children, the root
-* partitioned table's list will contain all such indexes.
-*/
-   if (rte->relkind == RELKIND_PARTITIONED_TABLE)
-   rel->partitioned_child_rels = list_make1_int(rti);

@@ -274,55 +327,287 @@ build_simple_rel(PlannerInfo *root, int relid, 
RelOptInfo *parent)
   

Re: amcheck verification for GiST

2019-03-04 Thread Heikki Linnakangas

There's a little copy-pasto in gist_check_page_keys():


+   for (o = FirstOffsetNumber; o <= parent_maxoff; o = 
OffsetNumberNext(i))


Should be "OffsetNumberNext(o)".

I tested this patch with your testing patch from the other thread (after 
fixing the above), to leave behind incompletely split pages [1]. It 
seems that the amcheck code doesn't expect incomplete splits:


postgres=# SELECT gist_index_parent_check('x_c_idx');
ERROR:  index "x_c_idx" has inconsistent records

[1] 
https://www.postgresql.org/message-id/EB87A69B-EE5E-4259-9EEB-DA9DC1F7E265%40yandex-team.ru


- Heikki



Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 21:26, Andrey Borodin wrote:

3 янв. 2019 г., в 23:47, Andrey Borodin 
написал(а):



* Bitmapset stores 32 bit signed integers, but BlockNumber is
unsigned. So this will fail with an index larger than 2^31
blocks. That's perhaps academical, I don't think anyone will try
to create a 16 TB GiST index any time soon. But it feels wrong to
introduce an arbitrary limitation like that.


Looks like bitmapset is unsuitable for collecting block numbers due
to the interface. Let's just create custom container for this? Or
there is one other option: for each block number throw away sign
bit and consider page potentially internal and potentially empty
leaf if (blkno & 0x7FF) is in bitmapset. And then we will have
to create at least one 17Tb GiST to check it actually works.


Heikki, how do you think, is implementing our own radix tree for this
is viable solution? I've written working implementation with 4-level
statically typed tree. If we follow this route, probably, there must
be tests for this data structure.


Yeah, seems reasonable.

- Heikki



Re: [HACKERS] CLUSTER command progress monitor

2019-03-04 Thread Tatsuro Yamada

On 2019/03/02 4:15, Robert Haas wrote:

On Thu, Feb 28, 2019 at 11:54 PM Tatsuro Yamada
 wrote:

Attached patch is wip patch.



Thanks for your comments! :)
I revised the code and the document.



+   CLUSTER and VACUUM FULL,
showing current progress.

and -> or


Fixed.



+   certain commands during command execution.  Currently, the suppoted
+   progress reporting commands are VACUUM and
CLUSTER.

suppoted -> supported

But I'd just say: Currently, the only commands which support progress
reporting are VACUUM and
CLUSTER.


I choose the latter. Fixed.



+   Running VACUUM FULL is listed in
pg_stat_progress_cluster
+   view because it uses CLUSTER command
internally.  See .

How about: Running VACUUM FULL is listed in
pg_stat_progress_cluster because both
VACUUM FULL and CLUSTER rewrite
the table, while regular VACUUM only modifies it in
place.


Fixed.



+   Current processing command: CLUSTER/VACUUM FULL.

The command that is running.  Either CLUSTER or VACUUM FULL.


Fixed.



+   Current processing phase of cluster/vacuum full.  See  or .

Current processing phase of CLUSTER or VACUUM FULL.

Or maybe better, just abbreviate to: Current processing phase.


Fixed as you suggested.



+   Scan method of table: index scan/seq scan.

Eh, shouldn't this be gone now?  And likewise for the view definition?


Fixed. Sorry, It was an oversight.



+   OID of the index.

If the table is being scanned using an index, this is the OID of the
index being used; otherwise, it is zero.


Fixed.



+ heap_tuples_total

Leftovers.  Skipping over the rest of your documentation changes since
it looks like a bunch of things there still need to be updated.


I agree. Thanks a lot!
I'll divide the patch into two patch such as code and document.



+ pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);

This now appears inside cluster_rel(), but also vacuum_rel() is still
doing the same thing.  That's wrong.


It was an oversight too. I fixed.



+ if(OidIsValid(indexOid))

Missing space.  Please pgindent.


Fixed.
I Will do pgindent later.


Please find attached files. :)

Thanks,
Tatsuro Yamada

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 3e229c693c..0d0f8f0e31 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -906,6 +906,30 @@ CREATE VIEW pg_stat_progress_vacuum AS
 FROM pg_stat_get_progress_info('VACUUM') AS S
 		LEFT JOIN pg_database D ON S.datid = D.oid;
 
+CREATE VIEW pg_stat_progress_cluster AS
+SELECT
+S.pid AS pid,
+S.datid AS datid,
+D.datname AS datname,
+S.relid AS relid,
+CASE S.param1 WHEN 1 THEN 'CLUSTER'
+  WHEN 2 THEN 'VACUUM FULL'
+  END AS command,
+CASE S.param2 WHEN 0 THEN 'initializing'
+  WHEN 1 THEN 'seq scanning heap'
+  WHEN 2 THEN 'index scanning heap'
+  WHEN 3 THEN 'sorting tuples'
+  WHEN 4 THEN 'writing new heap'
+  WHEN 5 THEN 'swapping relation files'
+  WHEN 6 THEN 'rebuilding index'
+  WHEN 7 THEN 'performing final cleanup'
+  END AS phase,
+S.param3 AS cluster_index_relid,
+S.param4 AS heap_tuples_scanned,
+S.param5 AS heap_tuples_vacuumed
+FROM pg_stat_get_progress_info('CLUSTER') AS S
+LEFT JOIN pg_database D ON S.datid = D.oid;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index a74af4c171..f22ff590f0 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -35,10 +35,12 @@
 #include "catalog/objectaccess.h"
 #include "catalog/toasting.h"
 #include "commands/cluster.h"
+#include "commands/progress.h"
 #include "commands/tablecmds.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "optimizer/optimizer.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
 #include "storage/predicate.h"
@@ -275,6 +277,8 @@ cluster_rel(Oid tableOid, Oid indexOid, int options)
 	/* Check for user-requested abort. */
 	CHECK_FOR_INTERRUPTS();
 
+	pgstat_progress_start_command(PROGRESS_COMMAND_CLUSTER, tableOid);
+
 	/*
 	 * We grab exclusive access to the target rel and index for the duration
 	 * of the transaction.  (This is redundant for the single-transaction
@@ -385,6 +389,18 @@ cluster_rel(Oid tableOid, Oid indexOid, int options)
 	 */
 	CheckTableNotInUse(OldHeap, OidIsValid(indexOid) ? "CLUSTER" : "VACUUM");
 
+	/* Set command to column */
+	if (OidIsValid(indexOid))
+	{
+		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+	 PROGRESS_CLUSTER_COMMAND_CLUSTER);
+	}
+	else
+	{
+		pgstat_progress_update_param(PROGRESS_CLUSTER_COMMAND,
+	 PROGRESS_CLUSTER_COMMAND_VACUUM_FULL);
+	}
+
 	/*

Re: unconstify equivalent for volatile

2019-03-04 Thread Peter Eisentraut
On 2019-02-25 09:29, Peter Eisentraut wrote:
> On 2019-02-22 21:31, Andres Freund wrote:
>> On 2019-02-22 12:38:35 +0100, Peter Eisentraut wrote:
>>> On 2019-02-19 18:02, Andres Freund wrote:
 But even if we were to decide we'd want to keep a volatile in SetLatch()
 - which I think really would only serve to hide bugs - that'd not mean
 it's a good idea to keep it on all the other functions in latch.c.
> 
>> Right. But we should ever look/write into the contents of a latch
>> outside of latch.c, so I don't think that'd really be a problem, even if
>> we relied on volatiles.
> 
> So how about this patch?

committed

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] CLUSTER command progress monitor

2019-03-04 Thread Tatsuro Yamada

On 2019/03/02 4:15, Robert Haas wrote:

On Thu, Feb 28, 2019 at 11:54 PM Tatsuro Yamada
 wrote:

Attached patch is wip patch.


I rewrote the current design of the progress monitor and also
wrote discussion points in the middle of this email. I'd like to
get any feedback from -hackers.


=== Current design ===

CLUSTER command uses Index Scan or Seq Scan when scanning the heap.
Depending on which one is chosen, the command will proceed in the
following sequence of phases:

  * Scan method: Seq Scan
0. initializing (*2)
1. seq scanning heap(*1)
3. sorting tuples   (*2)
4. writing new heap (*1)
5. swapping relation files  (*2)
6. rebuilding index (*2)
7. performing final cleanup (*2)

  * Scan method: Index Scan
0. initializing (*2)
2. index scanning heap  (*1)
5. swapping relation files  (*2)
6. rebuilding index (*2)
7. performing final cleanup (*2)

VACUUM FULL command will proceed in the following sequence of phases:

1. seq scanning heap(*1)
5. swapping relation files  (*2)
6. rebuilding index (*2)
7. performing final cleanup (*2)

(*1): increasing the value in heap_tuples_scanned column
(*2): only shows the phase in the phase column

The view provides the information of CLUSTER command progress details as follows
# \d pg_stat_progress_cluster
  View "pg_catalog.pg_stat_progress_cluster"
  Column   |  Type   | Collation | Nullable | Default
---+-+---+--+-
 pid   | integer |   |  |
 datid | oid |   |  |
 datname   | name|   |  |
 relid | oid |   |  |
 command   | text|   |  |
 phase | text|   |  |
 cluster_index_relid   | bigint  |   |  |
 heap_tuples_scanned   | bigint  |   |  |
 heap_tuples_vacuumed  | bigint  |   |  |


=== Discussion points ===

 - Progress counter for "3. sorting tuples" phase
- Should we add pgstat_progress_update_param() in tuplesort.c like a
  "trace_sort"?
  Thanks to Peter Geoghegan for the useful advice!

 - Progress counter for "6. rebuilding index" phase
- Should we add "index_vacuum_count" in the view like a vacuum progress 
monitor?
  If yes, I'll add pgstat_progress_update_param() to reindex_relation() of 
index.c.
  However, I'm not sure whether it is okay or not.

 - pg_stat_progress_rewrite
- TBA


=== My test case ===

I share my test case of progress monitor.
If someone wants to watch the current progress monitor, you can use
this test case as a example.

[Terminal1]
Run this query on psql:

   select * from pg_stat_progress_cluster; \watch 0.05

[Terminal2]
Run these queries on psql:

drop table t1;

create table t1 as select a, random() * 1000 as b from generate_series(0, 
99) a;
create index idx_t1 on t1(a);
create index idx_t1_b on t1(b);
analyze t1;

-- index scan
set enable_seqscan to off;
cluster verbose t1 using idx_t1;

-- seq scan
set enable_seqscan to on;
set enable_indexscan to off;
cluster verbose t1 using idx_t1;

-- only given table name to cluster command
cluster verbose t1;

-- only cluster command
cluster verbose;

-- vacuum full
vacuum full t1;

-- vacuum full
vacuum full;




Thanks,
Tatsuro Yamada




Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote
Hi,

On 2019/03/01 22:01, Amit Langote wrote:
> Please find attached updated patches.  Will update source code comments,
> commit message and perform other fine-tuning over the weekend if possible.

I realized when "fine-tuning" that the patch 0001 contained too many
changes that seem logically separable.  I managed to divide it into the
following patches, which also amounts to a much shorter overall diff
against the master.  Also, the smaller individual patches made it easier
to spot a lot of useless diffs of inherit.c.

Attached patches are as follows:

1. Create the "otherrel" RelOptInfos of the appendrels as a separate step
of query_planner.  Newly added function add_other_rels_to_query() which
adds "other rel" RelOptInfos of child relations is called after
query_planner has finished distributing restrict clauses to baserels.
Child relations in this case include both those of flattened UNION ALL
subqueries and inheritance child tables.  Child RangeTblEntrys and
AppendRelInfos are added early in subquery_planner for both types of child
relations.  Of the two, we'd like to delay adding the inheritance
children, which is done in the next patch.

See patch 0001.

2. Defer inheritance expansion to add_other_rels_to_query().  Although the
purpose of doing this is to perform partition pruning before adding the
children, this patch doesn't change when the pruning occurs.  It deals
with other issues that must be taken care of due to adding children during
query_planner instead of during subquery_planner.  Especially,
inheritance_planner now has to add the child target relations on its own.
Also, delaying adding children also affects adding junk columns to the
query's targetlist based on PlanRowMarks, because preprocess_targetlist
can no longer finalize which junk columns to add for a "parent"
PlanRowMark; that must be delayed until all child PlanRowMarks are added
and their allMarkTypes propagated to the parent PlanRowMark.

See patch 0002.

3. Because inheritance_planner calls query_planner separately for each
target child relation and the patch 0002 above puts query_planner in
charge of inheritance expansion, that means child tables of source
inheritance sets will be added as many times as there are target children.
 This makes update/delete queries containing inherited source tables
somewhat slow.  This patch adjusts inheritance_planner to reuse source
inheritance children added during the planning of 1st child query for the
planning of subsequent child queries.

See patch 0003.

4. Now that all the issues arising due to late inheritance expansion have
been taken care of, this patch moves where partition pruning occurs.
Today it's in set_append_rel_size() and this patch moves it to
expand_partitioned_rtentry() where partitions are added.  Only the
partitions remaining after pruning are added, so some members of part_rels
can remain NULL.  Some of the places that access partition RelOptInfos
using that array needed to be made aware of that.

See patch 0004.

5. There are a few places which loop over *all* members of part_rels array
of a partitioned parent's RelOptInfo to do something with the partition
rels.   Some of those loops run even for point-lookup queries where only
one partition would be accessed, which is inefficient.  This patch adds a
Bitmapset member named 'live_parts' to RelOptInfo, whose value is the set
of indexes of unpruned partitions in the parent's RelOptInfo.  The
aforementioned loops are now replaced by looping over live_parts Bitmapset
instead.

See patch 0005.

6. set_relation_partition_info today copies the PartitionBoundInfo from
the relcache using partition_bounds_copy.  Doing partition_bounds_copy
gets expensive as the number of partitions increases and it really doesn't
seem necessary for the planner to create its own copy.  This patch removes
the partition_bounds_copy() and simply uses the relcache pointer.

See patch 0006.

Hope that the above division makes the changes easier to review.

Thanks,
Amit
From 1d83debe07040a93721f17785d075d26dab6f088 Mon Sep 17 00:00:00 2001
From: Amit 
Date: Sat, 2 Mar 2019 14:13:13 +0900
Subject: [PATCH v26 1/6] Build "other rels" of appendrel baserels in a
 separate step

add_other_rels_to_query() runs after query_planner has distributed
restrict clauses to base relations.  This will allow us to use
the clauses applied a given partitioned baserel to perform partition
pruning, and add other rels for only the unpruned partiitons.
---
 src/backend/optimizer/path/allpaths.c  |   2 +-
 src/backend/optimizer/plan/initsplan.c |  53 +--
 src/backend/optimizer/plan/planmain.c  |  15 +++--
 src/backend/optimizer/util/relnode.c   | 118 +
 src/include/optimizer/pathnode.h   |   2 +
 src/include/optimizer/planmain.h   |   1 +
 6 files changed, 140 insertions(+), 51 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..8d8a8f17d5 100644
---

Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote
Imai-san,

Thanks for the review.

On 2019/03/04 18:14, Imai, Yoshikazu wrote:
> I've taken at glance the codes and there are some minor comments about the 
> patch.
> 
> * You changed a usage/arguments of get_relation_info, but why you did it? I 
> made those codes back to the original code and checked it passes make check. 
> So ISTM there are no logical problems with not changing it. Or if you change 
> it, how about also change a usage/arguments of get_relation_info_hook in the 
> same way?
> 
> -get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
> -   RelOptInfo *rel)
> +get_relation_info(PlannerInfo *root, RangeTblEntry *rte, RelOptInfo *rel)
>  {
> + boolinhparent = rte->inh;
> - relation = table_open(relationObjectId, NoLock);
> + relation = heap_open(rte->relid, NoLock);
>   ...
>   if (get_relation_info_hook)
> - (*get_relation_info_hook) (root, relationObjectId, inhparent, 
> rel);
> + (*get_relation_info_hook) (root, rte->relid, rte->inh, rel);
> 
> 
> @@ -217,15 +272,13 @@ build_simple_rel(PlannerInfo *root, int relid, 
> RelOptInfo *parent)
>   /* Check type of rtable entry */
>   switch (rte->rtekind)
>   {
>   case RTE_RELATION:
>   /* Table --- retrieve statistics from the system 
> catalogs */
> - get_relation_info(root, rte->relid, rte->inh, rel);
> + get_relation_info(root, rte, rel);
> 
> 
> * You moved the codes of initializing of append rel's partitioned_child_rels 
> in set_append_rel_size() to build_simple_rel(), but is it better to do? I 
> also checked the original code passes make check by doing like above.
> 
> @@ -954,32 +948,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
>   Assert(IS_SIMPLE_REL(rel));
>  
>   /*
> -  * Initialize partitioned_child_rels to contain this RT index.
> -  *
> -  * Note that during the set_append_rel_pathlist() phase, we will bubble 
> up
> -  * the indexes of partitioned relations that appear down in the tree, so
> -  * that when we've created Paths for all the children, the root
> -  * partitioned table's list will contain all such indexes.
> -  */
> - if (rte->relkind == RELKIND_PARTITIONED_TABLE)
> - rel->partitioned_child_rels = list_make1_int(rti);
> 
> @@ -274,55 +327,287 @@ build_simple_rel(PlannerInfo *root, int relid, 
> RelOptInfo *parent)
>   
> list_length(rte->securityQuals));
>  
>   /*
> -  * If this rel is an appendrel parent, recurse to build "other rel"
> -  * RelOptInfos for its children.  They are "other rels" because they are
> -  * not in the main join tree, but we will need RelOptInfos to plan 
> access
> -  * to them.
> +  * Add the parent to partitioned_child_rels.
> +  *
> +  * Note that during the set_append_rel_pathlist() phase, values of the
> +  * indexes of partitioned relations that appear down in the tree will
> +  * be bubbled up into root parent's list so that when we've created
> +  * Paths for all the children, the root table's list will contain all
> +  * such indexes.
>*/
> - if (rte->inh)
> + if (rel->part_scheme)
> + rel->partitioned_child_rels = list_make1_int(relid);

Both of these changes are not present in the latest patches I posted,
where I also got rid of a lot of unnecessary diffs.

Thanks,
Amit




Re: libpq debug log

2019-03-04 Thread Kyotaro HORIGUCHI
Hello.

I came up with some random comments.

At Mon, 4 Mar 2019 08:13:00 +, "Iwata, Aya"  
wrote in <71E660EB361DF14299875B198D4CE5423DEF1844@g01jpexmbkw25>
> Hi,
> 
> Thank you for your comments and advice.
> 
> I'd like to consider your suggestions.
> I am planning to change libpq logging like this;
> 
> 1. Expand PQtrace() facility and improve libpq logging.
> 
> 2. Prepare "output level". There are 3 type of levels;
>   - TRADITIONAL   :  if set, outputs protocol messages
>   - LEVEL1:  if set, outputs phase and time
>   - LEVEL2:  if set, outputs both info TRADITIONAL and 
> LEVEL1

You appear to want to segregate the "traditional" output from
what you want to emit. At least Tom is explicitly suggesting to
throw away the hypothtical use cases for it. You may sort out
what kind of information you/we want to emit as log messages from
scratch:p

You may organize log kinds into hierachical levels like server
log messages or into orthogonal types that are individually
turned on. But it is not necessarily be a parameter of a logging
processor. (mentioned below)

> 3. Add "output phase" information; This is the processing flow. (ex. When 
> PQexec() start and end )

What is the advantage of it against just two independent messages
like PQexec_start and PQexec_end? (I don't see any advantage.)

> 4. Change output method to callback style; Default is stdout, and prepare 
> other callback functions that will be used more frequently.

Are you going to make something less-used the default behavior? I
think no one is opposing rich functionality as far as it is
replaceable.

> 5. Initialization method;
> In current one: PQtrace(PGconn *conn, FILE *stream);
> Proposed change: PQtraceEx(PGconn *conn, FILE *stream, PQloggingProcessor 
> callback_func , void *callback_arg, PGLogminlevel level);

I'm not sure we should add a new *EX() function. Couldn't we
just change the signature of PQtrace()?

callback_funs seems to be a single function. I think it's better
to have individual function for each message type.  Not
callback_func(PQLOG_EXEC_START, param_1, param_2,...)  ,but
PQloggingProcessor.PQexec_start(param_1, param_2, ...).

It is because the caller can simply pass values in its own type
to the function without casting or other transformations and
their types are checked statically.

I also think it's better that logger-specific paramters are not
passed in this level.  Maybe stream and level are logger-specific
paratmer, which can be combined into callback_arg, or can be
given via an API function.

> PQtrace() can be use as it is to consider compatibility with previous 
> applications, 
> so I leave PQtrace() and created a new function PQtraceEx(). 
> 
> After discussing the abovementioned, then maybe we can discuss more about 
> enabling trace output and changing the output style.

I'm not sure what you mean by "output style" but you can change
everything by replacing the whole callback processor, which may
be a dynamic loaded file which is loaded by the instruction in
both ~/.libpqrc and some API, like PQloadLoggingProcessor()?


> What do you think? I would appreciate your comments and suggestions.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: Minimal logical decoding on standbys

2019-03-04 Thread tushar

On 03/01/2019 11:16 PM, Andres Freund wrote:

So, if I understand correctly you do*not*  have a phyiscal replication
slot for this standby? For the feature to work reliably that needs to
exist, and you need to have hot_standby_feedback enabled. Does having
that fix the issue?


Ok, This time around  - I performed like this -

.)Master cluster (set wal_level=logical and hot_standby_feedback=on in 
postgresql.conf) , start the server and create a physical replication slot


postgres=# SELECT * FROM 
pg_create_physical_replication_slot('decoding_standby');

    slot_name | lsn
--+-
 decoding_standby |
(1 row)

.)Perform pg_basebackup using --slot=decoding_standby  with option -R . 
modify port= , start the server


.)Connect to slave and create a logical replication slot

postgres=# create table t(n int);
ERROR:  cannot execute CREATE TABLE in a read-only transaction
postgres=#

postgres=# SELECT * FROM 
pg_create_logical_replication_slot('standby_slot', 'test_decoding');

  slot_name   |    lsn
--+---
 standby_slot | 0/260
(1 row)

run pgbench (./pgbench -i -s 10 postgres) against  master and 
simultaneously- start  pg_recvlogical  , provide port= ( slave 
cluster)  and specify slot=standby_slot
./pg_recvlogical -d postgres  -p  -s 1 -F 1  -v --slot=standby_slot  
--start -f -



[centos@centos-cpula bin]$ ./pg_recvlogical -d postgres  -p  -s 1 -F 
1  -v --slot=standby_slot  --start -f -

pg_recvlogical: starting log streaming at 0/0 (slot standby_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot standby_slot)
pg_recvlogical: confirming write up to 0/30194E8, flush to 0/30194E8 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3019590, flush to 0/3019590 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/301D558, flush to 0/301D558 
(slot standby_slot)

BEGIN 476
COMMIT 476
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)
pg_recvlogical: confirming write up to 0/3034B40, flush to 0/3034B40 
(slot standby_slot)

BEGIN 477
COMMIT 477

If we do the same for the logical replication slot which created on 
Master cluster


[centos@centos-cpula bin]$ ./pg_recvlogical -d postgres  -s 1 -F 1 -v 
--slot=master_slot  --start -f -

pg_recvlogical: starting log streaming at 0/0 (slot master_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot master_slot)
table public.pgbench_accounts: INSERT: aid[integer]:65057 bid[integer]:1 
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65058 bid[integer]:1 
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65059 bid[integer]:1 
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65060 bid[integer]:1 
abalance[integer]:0 filler[character]:' '
table public.pgbench_accounts: INSERT: aid[integer]:65061 bid[integer]:1 
abalance[integer]:0 filler[character]:' '
table public

Re: Minimal logical decoding on standbys

2019-03-04 Thread tushar

On 03/04/2019 04:54 PM, tushar wrote:
.)Perform pg_basebackup using --slot=decoding_standby  with option -R 
. modify port= , start the server 


set primary_slot_name = 'decoding_standby'  in the postgresql.conf file 
of slave.


--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company




Re: amcheck verification for GiST

2019-03-04 Thread Heikki Linnakangas

On 04/03/2019 17:53, Heikki Linnakangas wrote:

I tested this patch with your testing patch from the other thread (after
fixing the above), to leave behind incompletely split pages [1]. It
seems that the amcheck code doesn't expect incomplete splits:

postgres=# SELECT gist_index_parent_check('x_c_idx');
ERROR:  index "x_c_idx" has inconsistent records


On closer look, I think that was because that testing patch to leave 
behind incomplete splits really did corrupt the index. It always 
inserted the downlink to the parent, but randomly skipped clearing the 
FOLLOW_RIGHT flag and updating the NSN in the child. That's not a valid 
combination. To test incomplete splits, you need to skip inserting the 
downlink to the parent, too.


- Heikki



Re: [HACKERS] Incomplete startup packet errors

2019-03-04 Thread Andrew Dunstan


On 3/3/19 3:52 PM, Tom Lane wrote:
> I wrote:
>> Andrew Dunstan  writes:
>>> Patch proposed by Christoph Berg is here:
>>> https://www.postgresql.org/message-id/20190228151336.GB7550%40msg.df7cb.de
>> Meh.  That doesn't silence only the zero-bytes case, and I'm also
>> rather afraid of the fact that it's changing COMMERROR to something
>> else.  I wonder whether (if client_min_messages <= DEBUG1) it could
>> result in trying to send the error message to the already-lost
>> connection.  It might be that that can't happen, but I think a fair
>> amount of rather subtle (and breakable) analysis may be needed.
> Concretely, what about doing the following instead?  This doesn't provide
> any mechanism for the DBA to adjust the logging behavior; but reducing
> log_min_messages to DEBUG1 would not be a very pleasant way to monitor for
> zero-data connections either, so I'm not that fussed about just dropping
> the message period for that case.  I kind of like that we no longer need
> the weird special case for SSLdone.
>
>   



Looks good to me.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: SQL statement PREPARE does not work in ECPG

2019-03-04 Thread Matsumura, Ryo
Hi Meskes-san


Thank you for your advice.

I attach a patch.
I didn't add additional tests to regression yet.


The patch allows the following:

  exec sql prepare(int) as select $1;
 exec sql execute st(1) into :out;

  exec sql prepare(text, text) as select $1 || $2;
 exec sql execute st('aaa', 'bbb') into :out;

But it doesn't allow to use host variable in parameter clause of EXECUTE 
statement like the following.
I'm afraid that it's not usefull. I will research the standard and other RDBMS.
If you have some information, please adivise to me.

  exec sql begin declare section;
  int var;
  exec sql end declare section;

  exec sql prepare(int) as select $1;
 exec sql execute st(:var) into :out;

  SQL error: bind message supplies 1 parameters, but prepared statement "" 
requires 0



I explain about the patch.

* PREPARE FROM or PREPARE AS without type clause
  It uses PQprepare(). It's not changed.

  [Preprocessor output]
  /* exec sql prepare st from "select ?"; */
  { ECPGprepare(__LINE__, NULL, 0, "st", "select ?");

  /* exec sql prepare st as select 1; */
  { ECPGprepare(__LINE__, NULL, 0, "st", " select 1 ");


* PREPARE AS with type clause
  It doesn't use PQprepare() but uses PQexecuteParams().

  [Preprocessor output]
  /* exec sql prepare st(text, text) as select $1 || '@2'; */
  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "prepare \"st\" ( text , 
text ) as select @1 || '@2'", ECPGt_EOIT, ECPGt_EORT);

  $1 in as clause is replaced by preprocessor at ecpg_param rule.
  @1 is replaced to $1 by ecpglib at end of ecpg_build_params().


* EXECUTE without type clause
  It uses PQexecPrepared(). It's not changed.

  [Preprocessor output]
  /* exec sql execute st into :ovar using :var; */
  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "st",
  ECPGt_int,&(var),(long)1,.

* EXECUTE with parameter clause
  It uses PQexecuteParams().

  [Preprocessor output]
  /* exec sql execute st('abcde') into :ovar_s; */
  { ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_normal, "execute \"st\" ( 'abcde' 
)", ECPGt_EOIT,
  .

This approach causes the above constraint that users cannot use host variables 
in parameter clause in EXECUTE statement
because ecpglib sends 'P' message with "execute \"st\" ($1)" and sends 'B' one 
parameter, but backend always regards the number of parameters in EXECUTE 
statement as zero.
I don't have any other idea...


Regards
Ryo Matsumura


ecpg_prepare_as_v1_0.patch
Description: ecpg_prepare_as_v1_0.patch


Re: [HACKERS] Incomplete startup packet errors

2019-03-04 Thread Christoph Berg
Re: Andrew Dunstan 2019-03-04 
<7cc6d2c1-bd87-9890-259d-36739c247...@2ndquadrant.com>
> Looks good to me.

+1.

Christoph



Re: GiST VACUUM

2019-03-04 Thread Heikki Linnakangas

On 04/01/2019 02:47, Andrey Borodin wrote:

2 янв. 2019 г., в 20:35, Heikki Linnakangas  написал(а):

In patch #1, to do the vacuum scan in physical order:
...
I think this is ready to be committed, except that I didn't do any testing. We 
discussed the need for testing earlier. Did you write some test scripts for 
this, or how have you been testing?

Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an 
insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE 
THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING:  RESCAN TRIGGERED BY NSN
WARNING:  RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING:  RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).


Thanks! As I noted at 
https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi, 
the test patch left the index corrupt. I fixed it so that it leaves 
behind incompletely split pages, without the corruption, see attached. 
It's similar to yours, but let me recap what it does:


* Hacks gistbuild(), create 100 empty pages immediately after the root 
pages. They are leaked, so they won't be reused until the a VACUUM sees 
them and puts them to the FSM


* Hacks gistinserttuples(), to leave the split incompleted with 50% 
probability


* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to 
yours, but it also tries to verify that the index returns correct 
results. It prints a result set like this:


   sum
-
 -364450
  364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should 
see "RESCAN" lines in the log, to show that jumping left happened. 
Because the behavior is random and racy, you may need to run the script 
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED 
BY FollowRight" cases. Especially the "FollowRight" case happens less 
frequently than the NSN case, you might need to run the script > 10 
times to see it.


I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as 
your latest patch v19, except for some tiny comment kibitzing. I'll mark 
this as Ready for Committer in the commitfest app, and will try to 
commit it in the next couple of days.


- Heikki


gist-vacuum-test.sh
Description: application/shellscript
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..cad4a2a46e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1187,6 +1187,8 @@ gistinserttuple(GISTInsertState *state, GISTInsertStack *stack,
 			InvalidBuffer, InvalidBuffer, false, false);
 }
 
+static bool HACK = false;
+
 /* 
  * An extended workhorse version of gistinserttuple(). This version allows
  * inserting multiple tuples, or replacing a single tuple with multiple tuples.
@@ -1230,6 +1232,14 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
 
 	/* Insert the tuple(s) to the page, splitting the page if necessary */
+	if (BufferIsValid(leftchild) && HACK)
+	{
+		/* skip actually inserting */
+		splitinfo = NULL;
+		is_split = false;
+	}
+	else
+	{
 	is_split = gistplacetopage(state->r, state->freespace, giststate,
 			   stack->buffer,
 			   tuples, ntup,
@@ -1238,6 +1248,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 			   &splitinfo,
 			   true,
 			   state->heapRel);
+	}
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1256,7 +1267,12 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	 * didn't have to split, release it ourselves.
 	 */
 	if (splitinfo)
+	{
+		if (random() % 2 == 0)
+			HACK = true;
 		gistfinishsplit(state, stack, giststate, splitinfo, unlockbuf);
+		HACK = false;
+	}
 	else if (unlockbuf)
 		LockBuffer(stack->buffer, GIST_UNLOCK);
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index bd142a3560..fdfc54b009 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -201,6 +201,9 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	for (int i = 0; i < 100; i++)
+		ReleaseBuffer(ReadBuffer(index, P_NEW));
+
 	/*
 	 * Do the heap scan.
 	 */
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 1cbcb038f7..c306621e1e 100644
--- a/src/backend/a

Re: Online verification of checksums

2019-03-04 Thread Tomas Vondra



On 3/4/19 4:09 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
>> I agree that having a server function (extension?) to do a full checksum
>> verification, possibly bandwidth-controlled, would be a good thing. However
>> it would have side effects, such as interfering deeply with the server page
>> cache, which may or may not be desirable.
> 
> In what is that different from VACUUM or a sequential scan?  It is
> possible to use buffer ring replacement strategies in such cases using
> the normal clock-sweep algorithm, so that scanning a range of pages
> does not really impact Postgres shared buffer cache.
> --

But Fabien was talking about page cache, not shared buffers. And we
can't use custom ring buffer there. OTOH I don't see why accessing the
file through SQL function would behave any differently than direct
access (i.e. what the tool does now).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: \describe*

2019-03-04 Thread Ibrar Ahmed
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Thanks for the patch, I have reviewed the patch and have some comments about 
the patch. The review includes the testing of the patch along with some code 
review. 

Here are my testings results, 

- Tab completion for \descibe-verbose.
I know that \d+ tab completion is also not there, but I think we must have tab 
completion for \descibe-verbose.

postgres=# \describe-
\describe-extension   \describe-replication-publication 
\describe-user-mapping
\describe-foreign-data-wrapper\describe-replication-subscription
\describe-view
\describe-foreign-server  \describe-role
\describe-window-function
\describe-foreign-table   \describe-rule  
 ...


- Error message in each command.
There is an error message after each command, here is the example. 
postgres=# \describe
List of relations
 Schema | Name | Type  |  Owner  
+--+---+-
 public | foo  | table | vagrant

(1 row)
Invalid command \describe. Try \? for help.


I think this status is causing the problem.



+   /* standard 
listing of interesting things */
+   success = 
listTables("tvmsE", NULL, show_verbose, show_system);
+   }
+   status = PSQL_CMD_UNKNOWN;




- Confusion about \desc and \desC
There is confusion while running the \desc command. I know the problem, but the 
user may confuse by this.
postgres=# \desC
   List of foreign servers
 Name | Owner | Foreign-data wrapper 
--+---+--
(0 rows)

postgres=# \desc
Invalid command \desc. Try \? for help.

- Auto-completion of commands.
There is some more confusion in the completion of commands. 

This command shows List of aggregates.
postgres=# \describe-aggregate-function 
 List of aggregate functions
 Schema | Name | Result data type | Argument data types | Description 
+--+--+-+-
(0 rows)



This command shows a list of relation "\d" 
postgres=# \describe-aggregatE-function 
List of relations
 Schema | Name | Type  |  Owner  
+--+---+-
 public | foo  | table | vagrant
(1 row)

This command also shows a list of relations "\d".
postgres=# \describe-aggr 
List of relations
 Schema | Name | Type  |  Owner  
+--+---+-
 public | foo  | table | vagrant
(1 row)

This command shows error messages.
postgres=# \descr
Invalid command \descr. Try \? for help.

...


Code review. 
-

I have done a brief code review except for the documentation code. I don't like 
this code 

if (cmd_match(cmd,"describe-aggregate-function"))   
 
 success = describeAggregates(pattern, show_verbose, show_system);
 else if (cmd_match(cmd, "describe-access-method"))
 success = describeAccessMethods(pattern, 
show_verbose);
 else if (cmd_match(cmd, "describe-tablespace"))
 success = describeTablespaces(pattern, 
show_verbose);
 else if (cmd_match(cmd, "describe-conversion"))
 success = listConversions(pattern, 
show_verbose, show_system);
 else if (cmd_match(cmd, "describe-cast"))
 success = listCasts(pattern, show_verbose


This can be achieved with the list/array/hash table, so I have changed that 
code in the attached patch just for a sample if you want I can do that for 
whole code.

-- 
Ibrar Ahmed

The new status of this patch is: Waiting on Author


Re: Ordered Partitioned Table Scans

2019-03-04 Thread Antonin Houska
David Rowley  wrote:

> On Mon, 5 Nov 2018 at 10:46, David Rowley  
> wrote:
> > On 1 November 2018 at 22:05, Antonin Houska  wrote:
> > > I think these conditions are too restrictive:
> > >
> > > /*
> > >  * Determine if these pathkeys match the partition order, or 
> > > reverse
> > >  * partition order.  It can't match both, so only go to the 
> > > trouble of
> > >  * checking the reverse order when it's not in ascending partition
> > >  * order.
> > >  */
> > > partition_order = pathkeys_contained_in(pathkeys,
> > > partition_pathkeys);
> > > partition_order_desc = !partition_order &&
> > > pathkeys_contained_in(pathkeys,
> > > 
> > > partition_pathkeys_desc);
> > >
> 
> > The problem with doing that is that if the partition keys are better
> > than the pathkeys then we'll most likely fail to generate any
> > partition keys at all due to lack of any existing eclass to use for
> > the pathkeys. It's unsafe to use just the prefix in this case as the
> > eclass may not have been found due to, for example one of the
> > partition keys having a different collation than the required sort
> > order of the query. In other words, we can't rely on a failure to
> > create the pathkey meaning that a more strict sort order is not
> > required.
> 
> I had another look at this patch and it seems okay just to add a new
> flag to build_partition_pathkeys() to indicate if the pathkey List was
> truncated or not.  In generate_mergeappend_paths() we can then just
> check that flag before checking if the partiiton pathkeys are
> contained in pathkeys. It's fine if the partition keys were truncated
> for the reverse of that check.
> 
> I've done this in the attached and added additional regression tests
> for this case.

I agree with this approach and I'm also fine with your other comments / code
changes to address my review.

As for the latest version (v8-0001-...) I've only caught a small typo: "When
the first subpath needs sorted ...". It was probably meant "... needs sort
...".

-- 
Antonin Houska
https://www.cybertec-postgresql.com



Re: \describe*

2019-03-04 Thread Ibrar Ahmed
Hi Corey,

Here is the modified patch (sample).



On Mon, Mar 4, 2019 at 7:02 PM Ibrar Ahmed  wrote:

> The following review has been posted through the commitfest application:
> make installcheck-world:  not tested
> Implements feature:   not tested
> Spec compliant:   not tested
> Documentation:not tested
>
> Thanks for the patch, I have reviewed the patch and have some comments
> about the patch. The review includes the testing of the patch along with
> some code review.
>
> Here are my testings results,
>
> - Tab completion for \descibe-verbose.
> I know that \d+ tab completion is also not there, but I think we must have
> tab completion for \descibe-verbose.
>
> postgres=# \describe-
> \describe-extension
>  \describe-replication-publication \describe-user-mapping
> \describe-foreign-data-wrapper
> \describe-replication-subscription\describe-view
> \describe-foreign-server  \describe-role
>   \describe-window-function
> \describe-foreign-table   \describe-rule
>  ...
>
>
> - Error message in each command.
> There is an error message after each command, here is the example.
> postgres=# \describe
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
>
> (1 row)
> Invalid command \describe. Try \? for help.
>
>
> I think this status is causing the problem.
>
>
>
> +   /*
> standard listing of interesting things */
> +   success =
> listTables("tvmsE", NULL, show_verbose, show_system);
> +   }
> +   status = PSQL_CMD_UNKNOWN;
>
>
>
>
> - Confusion about \desc and \desC
> There is confusion while running the \desc command. I know the problem,
> but the user may confuse by this.
> postgres=# \desC
>List of foreign servers
>  Name | Owner | Foreign-data wrapper
> --+---+--
> (0 rows)
>
> postgres=# \desc
> Invalid command \desc. Try \? for help.
>
> - Auto-completion of commands.
> There is some more confusion in the completion of commands.
>
> This command shows List of aggregates.
> postgres=# \describe-aggregate-function
>  List of aggregate functions
>  Schema | Name | Result data type | Argument data types | Description
> +--+--+-+-
> (0 rows)
>
>
>
> This command shows a list of relation "\d"
> postgres=# \describe-aggregatE-function
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
> (1 row)
>
> This command also shows a list of relations "\d".
> postgres=# \describe-aggr
> List of relations
>  Schema | Name | Type  |  Owner
> +--+---+-
>  public | foo  | table | vagrant
> (1 row)
>
> This command shows error messages.
> postgres=# \descr
> Invalid command \descr. Try \? for help.
>
> ...
>
>
> Code review.
> -
>
> I have done a brief code review except for the documentation code. I don't
> like this code
>
> if (cmd_match(cmd,"describe-aggregate-function"))
>
>  success = describeAggregates(pattern, show_verbose, show_system);
>  else if (cmd_match(cmd,
> "describe-access-method"))
>  success = describeAccessMethods(pattern,
> show_verbose);
>  else if (cmd_match(cmd,
> "describe-tablespace"))
>  success = describeTablespaces(pattern,
> show_verbose);
>  else if (cmd_match(cmd,
> "describe-conversion"))
>  success = listConversions(pattern,
> show_verbose, show_system);
>  else if (cmd_match(cmd, "describe-cast"))
>  success = listCasts(pattern, show_verbose
>
>
> This can be achieved with the list/array/hash table, so I have changed
> that code in the attached patch just for a sample if you want I can do that
> for whole code.
>
> --
> Ibrar Ahmed
>
> The new status of this patch is: Waiting on Author
>


-- 
Ibrar Ahmed


0001-Add-describe-commands-to-compliment-d-commands-ibrar-v2.patch
Description: Binary data


Re: Online verification of checksums

2019-03-04 Thread Tomas Vondra




On 3/4/19 2:00 AM, Michael Paquier wrote:
> On Sun, Mar 03, 2019 at 03:12:51AM +0100, Tomas Vondra wrote:
>> You and Andres may be right that trying to verify checksums online
>> without close interaction with the server is ultimately futile (or at
>> least overly complex). But I'm not sure those issues (torn pages and
>> partial reads) are very good arguments, considering basebackup has to
>> deal with them too. Not sure.
> 
> FWIW, I don't think that the backend is right in its way of checking
> checksums the way it does currently either with warnings and a limited
> set of failures generated.  I raised concerns about that unfortunately
> after 11 has been GA'ed, which was too late, so this time, for this
> patch, I prefer raising them before the fact and I'd rather not spread
> this kind of methodology around the core code more and more.

I still don't understand what issue you see in how basebackup verifies
checksums. Can you point me to the explanation you've sent after 11 was
released?

> I work a lot with virtualization, and I have seen ESX hanging around
> I/O requests from time to time depending on the environment used
> (which is actually wrong, anyway, but a lot of tests happen on a
> daily basis on the stuff I work on).  What's presented on this thread
> is *never* going to be 100% safe, and would generate false positives
> which can be confusing for the user.  This is not a good sign.

So you have a workload/configuration that actually results in data
corruption yet we fail to detect that? Or we generate false positives?
Or what do you mean by "100% safe" here?


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Re: proposal: variadic argument support for least, greatest function

2019-03-04 Thread David Steele

On 3/1/19 3:59 AM, Chapman Flack wrote:

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:tested, passed

For completeness, I'll mark this reviewed again. It passes installcheck-world, 
the latest patch addresses the suggestions from me, and is improved on the 
code-structure matters that Tom pointed out. I don't know if it will meet Tom's 
threshold for desirability overall, but that sounds like a committer call at 
this point, so I'll change it to RfC.


Both committers who have looked at this patch (Tom, and Andres in his 
patch roundup [1]) recommend that it be rejected.


If no committer steps up in the next week I think we should mark it as 
rejected.


Regards,
--
-David
da...@pgmasters.net

[1] 
https://www.postgresql.org/message-id/20190214203752.t4hl574k6jlu4t25%40alap3.anarazel.de




Re: insensitive collations

2019-03-04 Thread Daniel Verite
Peter Eisentraut wrote:

[v7-0001-Collations-with-nondeterministic-comparison.patch]

+GenericMatchText(const char *s, int slen, const char *p, int plen, Oid
collation)
 {
+  if (collation && !lc_ctype_is_c(collation) && collation !=
DEFAULT_COLLATION_OID)
+  {
+pg_locale_tlocale = pg_newlocale_from_collation(collation);
+
+if (locale && !locale->deterministic)
+  ereport(ERROR,
+  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+   errmsg("nondeterministic collations are not supported for
LIKE")));
+  }

This test gets side-stepped when pattern_fixed_prefix() in selfuncs.c
returns Pattern_Prefix_Exact, and the code optimizes the operation by
converting it to a bytewise equality test, or a bytewise range check
in the index with Pattern_Type_Prefix.

Here's a reproducer:

===
create collation ciai (locale='und-u-ks-level1', deterministic=false,
provider='icu');

create table w(t text collate "C");

insert into w select md5(i::text) from generate_series(1,1) as i;
insert into w values('abc');

create index indexname on w(t );

select t from w where t like 'ABC' collate ciai;
 t 
---
(0 rows)

select t from w where t like 'ABC%' collate ciai;
 t 
---
(0 rows)

===

For the LIKE operator, I think the fix should be that like_fixed_prefix() 
should always return Pattern_Prefix_None for non-deterministic collations.

For regular expressions, pg_set_regex_collation() is called at some
point when checking for a potential prefix, and since it errors out with
non-deterministic collations, this issue is taken care of already.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: pg_dump multi VALUES INSERT

2019-03-04 Thread Alvaro Herrera
On 2019-Mar-02, Fabien COELHO wrote:

> Although I'm all in favor of checking the int associated to the option, I do
> not think that it warrants three checks and messages. I would suggest to
> factor them all as just one check and one (terse) message.

I suggest ("rows-per-insert must be in range 1..%d", INT_MAX), like
extra_float_digits and compression level.

> About the output: I'd suggest to indent one line per row, something like:
> 
>   INSERT INTO foo VALUES
> (..., ..., ..., ...),
> (..., ..., ..., ...),
> (..., ..., ..., ...);

+1.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: query logging of prepared statements

2019-03-04 Thread Arthur Zakirov

Hello Justin,

On 27.02.2019 21:06, Justin Pryzby wrote:

I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather than
making such logs conditional on log_error_verbosity=VERBOSE, since
log_error_verbosity is documented to control whether these are output:
DETAIL/HINT/QUERY/CONTEXT/SQLSTATE.

I looked the patch. I got interesting result with different parameters.

With log_statement='all' and log_min_duration_statement='0' I get:

=# execute test_ins(3);
LOG:  statement: execute test_ins(3);
LOG:  duration: 17.283 ms

But with log_statement='none' and log_min_duration_statement='0' I get:

=# execute test_ins(3);
LOG:  duration: 8.439 ms  statement: execute test_ins(3);
DETAIL:  prepare: prepare test_ins (int) as
insert into test values ($1);

Is it intended? In the second result I got the query details.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: Online verification of checksums

2019-03-04 Thread Magnus Hagander
On Mon, Mar 4, 2019 at 3:02 PM Tomas Vondra 
wrote:

>
>
> On 3/4/19 4:09 AM, Michael Paquier wrote:
> > On Sun, Mar 03, 2019 at 07:58:26AM +0100, Fabien COELHO wrote:
> >> I agree that having a server function (extension?) to do a full checksum
> >> verification, possibly bandwidth-controlled, would be a good thing.
> However
> >> it would have side effects, such as interfering deeply with the server
> page
> >> cache, which may or may not be desirable.
> >
> > In what is that different from VACUUM or a sequential scan?  It is
> > possible to use buffer ring replacement strategies in such cases using
> > the normal clock-sweep algorithm, so that scanning a range of pages
> > does not really impact Postgres shared buffer cache.
> > --
>
> But Fabien was talking about page cache, not shared buffers. And we
> can't use custom ring buffer there. OTOH I don't see why accessing the
> file through SQL function would behave any differently than direct
> access (i.e. what the tool does now).
>

It shouldn't.

One other thought that I had around this though, which if it's been covered
before and I missed it, please disregard :)

The *online* version of the tool is very similar to running pg_basebackup
to /dev/null, is it not? Except it doesn't set the cluster to backup mode.
Perhaps what we really want is a simpler way to do *that*. That wouldn't
necessarily make it a SQL callable function, but it would be a CLI tool
that would call a command on a walsender for example.

(We'd of course still need the standalone tool for offline checks)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Rahila Syed
Hi Alvaro,


> On 2019-Feb-13, Amit Langote wrote:
>
> > Doesn't the name amphasename sound a bit too generic, given that it can
> > only describe the phases of ambuild?  Maybe ambuildphase?
>
> Hmm, yeah, maybe it does.  I renamed it "ambuildphasename", since it's
> not about reporting the phase itself -- it's about translating the phase
> number to the string that's reported to the user.
>
> The attached patch does it that way.  Also, when an index build uses an
> AM that doesn't support progress reporting, it no longer reports a NULL
> phase name while building.  I also changed it to report the progress of
> phase 7 (heap scan validation) using block numbers rather than tuple
> counts.  I also tweaked the strings reported in the view.  They're
> clearer now IMO.
>
> One slight annoyance is that when parallel workers are used, the last
> block number reported in phase 3/subphase 2 (IndexBuildHeapScan stuff)
> is not necessarily accurate, since the tail of the table could well be
> scanned by a worker that's not the leader, and we only report in the
> leader when it gets a new block.
>
> When the AM does not support progress reporting, everything stays as
> zeros during the index build phase.  It's easy to notice how slow hash
> indexes are to build compared to btrees this way!  Maybe it'd be
> better fallback on reporting block numbers in IndexBuildHeapScan when
> this happens.  Thoughts?
>
> I added docs to the monitoring section -- that's the bulkiest part of
> the patch.
>

1. Thank you for incorporating review comments.
Can you please rebase the latest
0001-Report-progress-of-CREATE-INDEX-operations.patch on master? Currently
it does not apply on 754b90f657bd54b482524b73726dae4a9165031c


>  15:56:44.694 | building index (3 of 8): initializing (1/5)|
>  442478 |  442399 |0 |   0 |0
> |   0
>  15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |
>  442478 |  442399 |1 |   0 |0
> |   0
>  15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |
>  442478 |  442399 |1 |   79057 |0
> |   0
>

2. In the above report, even though we are reporting progress in terms of
tuples_done for final btree sort & load phase we have not cleared
the blocks_done entry from previous phases. I think this can be confusing
as the blocks_done does not correspond to the tuples_done in the current
phase.


-- 
Rahila Syed
Performance Engineer
2ndQuadrant
http://www.2ndQuadrant.com 
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: monitoring CREATE INDEX [CONCURRENTLY]

2019-03-04 Thread Rahila

Hi Alvaro,

Resending the email as earlier one didn't get sent on pgsql-hackers.

On 2/23/19 3:24 AM, Alvaro Herrera wrote:

On 2019-Feb-13, Amit Langote wrote:


Doesn't the name amphasename sound a bit too generic, given that it can
only describe the phases of ambuild?  Maybe ambuildphase?

Hmm, yeah, maybe it does.  I renamed it "ambuildphasename", since it's
not about reporting the phase itself -- it's about translating the phase
number to the string that's reported to the user.

The attached patch does it that way.  Also, when an index build uses an
AM that doesn't support progress reporting, it no longer reports a NULL
phase name while building.  I also changed it to report the progress of
phase 7 (heap scan validation) using block numbers rather than tuple
counts.  I also tweaked the strings reported in the view.  They're
clearer now IMO.

One slight annoyance is that when parallel workers are used, the last
block number reported in phase 3/subphase 2 (IndexBuildHeapScan stuff)
is not necessarily accurate, since the tail of the table could well be
scanned by a worker that's not the leader, and we only report in the
leader when it gets a new block.

When the AM does not support progress reporting, everything stays as
zeros during the index build phase.  It's easy to notice how slow hash
indexes are to build compared to btrees this way!  Maybe it'd be
better fallback on reporting block numbers in IndexBuildHeapScan when
this happens.  Thoughts?

I added docs to the monitoring section -- that's the bulkiest part of
the patch.


1. Thank you for incorporating review comments.
Can you please rebase the latest 0001-Report-progress-of-
CREATE-INDEX-operations.patch on master? Currently it does not apply on 
754b90f657bd54b482524b73726dae4a9165031c

  15:56:44.694 | building index (3 of 8): initializing (1/5)|   
442478 |  442399 |0 |   0 |0 |  
 0
  15:56:44.705 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
  15:56:44.716 | building index (3 of 8): sorting tuples, spool 1 (3/5) |   
442478 |  442399 |1 |   0 |0 |  
 0
  15:56:44.727 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |   79057 |0 |  
 0
  15:56:44.738 | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  217018 |0 |  
 0
  15:56:44.75  | building index (3 of 8): final btree sort & load (5/5) |   
442478 |  442399 |1 |  353804 |0 |  
 0
  
2. In the above report, even though we are reporting progress in terms 
of tuples_done for final btree sort & load phase we have not cleared
the blocks_done entry from previous phases. I think this can be 
confusing as the blocks_done does not correspond to the tuples_done in 
the final btree sort & load phase.


Thank you,
Rahila Syed




Re: speeding up planning with partitions

2019-03-04 Thread Jesper Pedersen

Hi Amit,

Passes check-world.

On 3/4/19 5:38 AM, Amit Langote wrote:

See patch 0001.



+* members of any appendrels we find here are added built later when

s/built//


See patch 0002.



+   grouping_planner(root, false, 0.0 /* retrieve all tuples */);

Move comment out of function call.

+   if (root->simple_rte_array[childRTindex])
+   elog(ERROR, "rel %d already exists", childRTindex);
+   root->simple_rte_array[childRTindex] = childrte;
+   if (root->append_rel_array[childRTindex])
+   elog(ERROR, "child relation %d already exists", 
childRTindex);
+   root->append_rel_array[childRTindex] = appinfo;


Could the "if"s be made into Assert's instead ?

+ * the newly added bytes with zero

Extra spaces

+   if (rte->rtekind == RTE_RELATION &&  !root->contains_inherit_children)

s/TAB/space


See patch 0003.



+* because they correspond to flattneed UNION ALL subqueries.  
Especially,

s/flattneed/flatten


See patch 0004.



+* no need to make any new entries, because anything that'd need those

Use "would" explicit

+ * this case, since it needn't be scanned.

, since it doesn't need to be scanned


See patch 0005.

See patch 0006.



I'll run some tests using a hash partitioned setup.

Best regards,
 Jesper



Re: psql show URL with help

2019-03-04 Thread Magnus Hagander
On Sun, Mar 3, 2019 at 10:48 PM David Fetter  wrote:

> On Sun, Mar 03, 2019 at 09:57:25PM +0100, Magnus Hagander wrote:
> > On Sun, Mar 3, 2019 at 7:14 PM David Fetter  wrote:
> >
> > > On Wed, Feb 27, 2019 at 09:14:59AM +0100, Peter Eisentraut wrote:
> > > > + url = psprintf("
> > > https://www.postgresql.org/docs/%s/%s.html";,
> > > > +
> > > strstr(PG_VERSION, "devel") ? "devel" : PG_MAJORVERSION,
> > > > +
> > > QL_HELP[i].docbook_id);
> > >
> > > Do we need to make sure that the docs are published under the major
> > > version as soon as we get to alpha, or do we need something more like
> > > this?
> > >
> > > url = psprintf("https://www.postgresql.org/docs/%s/%s.html";,
> > > (strstr(PG_VERSION, "devel") || strstr(PG_VERSION, "beta")
> ||
> > >  strstr(PG_VERSION, "alpha")) : "devel" : PG_MAJORVERSION,
> > > QL_HELP[i].docbook_id);
> > >
> >
> > We don't really release alphas any more. And we do load the documentation
> > alongside the betas. (Last time we did an alpha was so long ago I don't
> > remember if we loaded docs)
>
> If the first thing we do when we move from devel to some other state
> (beta, RC, etc.) is to publish the docs under the major version
> number, then maybe this test should be more along the lines of looking
> for anything that's neither devel nor a number, extract the number,
> and use that.
>

Well, alpha versions do go under the numeric URL. Whether we load the docs
at that time or not we can just choose -- but there is no reason not to. So
yeah, that sounds like it would work better.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Minimal logical decoding on standbys

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 16:54:32 +0530, tushar wrote:
> On 03/01/2019 11:16 PM, Andres Freund wrote:
> > So, if I understand correctly you do*not*  have a phyiscal replication
> > slot for this standby? For the feature to work reliably that needs to
> > exist, and you need to have hot_standby_feedback enabled. Does having
> > that fix the issue?
> 
> Ok, This time around  - I performed like this -
> 
> .)Master cluster (set wal_level=logical and hot_standby_feedback=on in
> postgresql.conf) , start the server and create a physical replication slot

Note that hot_standby_feedback=on needs to be set on a standby, not on
the primary (although it doesn't do any harm there).

Thanks,

Andres



Re: Problem with default partition pruning

2019-03-04 Thread Ibrar Ahmed
Hi 

Patch work fine to me, but I have one test case where default partition still 
scanned. 

postgres=# explain select * from test1 where (id < 10) and true;
QUERY PLAN 
---
 Append  (cost=0.00..55.98 rows=846 width=36)
   ->  Seq Scan on test1_1  (cost=0.00..25.88 rows=423 width=36)
 Filter: (id < 10)
   ->  Seq Scan on test1_def  (cost=0.00..25.88 rows=423 width=36)
 Filter: (id < 10)
(5 rows)

Re: Problem with default partition pruning

2019-03-04 Thread Ibrar Ahmed
Hi  Yuzuko Hosoya,

Ignore my last message, I think this is also a legitimate scan on default
partition.


On Mon, Mar 4, 2019 at 10:29 PM Ibrar Ahmed  wrote:

> Hi
>
> Patch work fine to me, but I have one test case where default partition
> still scanned.
>
> postgres=# explain select * from test1 where (id < 10) and true;
> QUERY PLAN
> ---
>  Append  (cost=0.00..55.98 rows=846 width=36)
>->  Seq Scan on test1_1  (cost=0.00..25.88 rows=423 width=36)
>  Filter: (id < 10)
>->  Seq Scan on test1_def  (cost=0.00..25.88 rows=423 width=36)
>  Filter: (id < 10)
> (5 rows)



-- 
Ibrar Ahmed


Re: POC: converting Lists into arrays

2019-03-04 Thread Robert Haas
On Sun, Mar 3, 2019 at 1:29 PM Tom Lane  wrote:
> > My main observation was from when the expression evaluation was using
> > lists all over. List iteration overhead was very substantial there. But
> > that's not a problem anymore, because all of those are gone now due to
> > the expression rewrite.  I personally wasn't actually advocating for a
> > new list implementation, I was/am advocating that we should move some
> > tasks over to a more optimized representation.
>
> I doubt that you'll get far with that; if this experiment is anything
> to go by, it's going to be really hard to make the case that twiddling
> the representation of widely-known data structures is worth the work
> and bug hazards.

I'm befuddled by this comment.  Andres is arguing that we shouldn't go
do a blind search-and-replace, but rather change certain things, and
you're saying that's going to be really hard because twiddling the
representation of widely-known data structures is really hard.  But if
we only change certain things, we don't *need* to twiddle the
representation of a widely-known data structure.  We just add a new
one and convert the things that benefit from it, like I proposed
upthread (and promptly got told I was wrong).

I think the reason why you're not seeing a performance benefit is
because the problem is not that lists are generically a more expensive
data structure than arrays, but that there are cases when they are
more expensive than arrays.  If you only ever push/pop at the front,
of course a list is going to be better.  If you often look up elements
by index, of course an array is going to be better.  If you change
every case where the code currently uses a list to use something else
instead, then you're changing both the winning and losing cases.
Yeah, changing things individually is more work, but that's how you
get the wins without incurring the losses.

I think David's results go in this direction, too.  Code that was
written on the assumption that list_nth() is slow is going to avoid
using it as much as possible, and therefore no benefit is to be
expected from making it fast.  If the author written the same code
assuming that the underlying data structure was an array rather than a
list, they might have picked a different algorithm which, as David's
results shows, could be a lot faster in some cases.  But it's not
going to come from just changing the way lists work internally; it's
going to come from redesigning the algorithms that are using lists to
do something better instead, as Andres's example of linearized
expression evaluation also shows.

> The cases I've been looking at suggest to me that we'd make far
> more progress on the excessive-palloc'ing front if we could redesign
> things to reduce unnecessary copying of parsetrees.  Right now the
> planner does an awful lot of copying because of fear of unwanted
> modifications of multiply-linked subtrees.  I suspect that we could
> reduce that overhead with some consistently enforced rules about
> not scribbling on input data structures; but it'd take a lot of work
> to get there, and I'm afraid it'd be easily broken :-(

I think that's a separate but also promising thing to attack, and I
agree that it'd take a lot of work to get there.  I don't think that
the problem with either parse-tree-copying or list usage is that no
performance benefits are to be had; I think it's that the amount of
work required to get those benefits is pretty large.  If it were
otherwise, somebody likely would have done it before now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Laurenz Albe
Masahiko Sawada wrote:
> Why do people want to just encrypt everything? For satisfying some
> security compliance?

I'd say that TDE primarily protects you from masked ninjas that
break into your server room and rip out the disks with your database
on them.

Or from people stealing your file system backups that you leave
lying around in public.

My guess is that this requirement almost always comes from security
departments that don't know a lot about the typical security threats
that databases face, or (worse) from lawmakers.

And these are probably the people who will insist that *everything*
is encrypted, even your commit log (unencrypted log? everyone can
read the commits?).

Yours,
Laurenz Albe




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Chris Howard



Or on your laptop



On 3/4/19 11:55 AM, Laurenz Albe wrote:

Masahiko Sawada wrote:

Why do people want to just encrypt everything? For satisfying some
security compliance?

I'd say that TDE primarily protects you from masked ninjas that
break into your server room and rip out the disks with your database
on them.

Or from people stealing your file system backups that you leave
lying around in public.

My guess is that this requirement almost always comes from security
departments that don't know a lot about the typical security threats
that databases face, or (worse) from lawmakers.

And these are probably the people who will insist that *everything*
is encrypted, even your commit log (unencrypted log? everyone can
read the commits?).

Yours,
Laurenz Albe









Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane
Robert Haas  writes:
> I think the reason why you're not seeing a performance benefit is
> because the problem is not that lists are generically a more expensive
> data structure than arrays, but that there are cases when they are
> more expensive than arrays.  If you only ever push/pop at the front,
> of course a list is going to be better.  If you often look up elements
> by index, of course an array is going to be better.  If you change
> every case where the code currently uses a list to use something else
> instead, then you're changing both the winning and losing cases.

I don't think this argument is especially on-point, because what I'm
actually seeing is just that there aren't any list operations that
are expensive enough to make much of an overall difference in
typical queries.  To the extent that an array reimplementation
reduces the palloc traffic, it'd take some load off that subsystem,
but apparently you need not-typical queries to really notice.
(And, if the real motivation is aggregate palloc savings, then yes you
really do want to replace everything...)

> Yeah, changing things individually is more work, but that's how you
> get the wins without incurring the losses.

The concern I have is mostly about the use of lists as core infrastructure
in parsetree, plantree, etc data structures.  I think any idea that we'd
replace those piecemeal is borderline insane: it's simply not worth it
from a notational and bug-risk standpoint to glue together some parts of
those structures differently from the way other parts are glued together.

regards, tom lane



Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Andres Freund
Hi,

On 2019-02-27 09:32:17 -0300, Alvaro Herrera wrote:
> On 2019-Feb-27, Dmitry Dolgov wrote:
> > But I hope there are no objections if I'll then submit the original
> > changes with more consistent null handling separately to make decision
> > about them more consciously.
> 
> I think we should save such a patch for whenever we next update the
> archive version number, which could take a couple of years given past
> history.  I'm inclined to add a comment near K_VERS_SELF to remind
> whoever next patches it.

The pluggable storage patchset contains exactly that... I've attached
the precursor patch (CREATE ACCESS METHOD ... TYPE TABLE), and the patch
for pg_dump support. They need a bit more cleanup, but it might be
useful information for this thread.

One thing I want to bring up here rather than in the pluggable storage
thread is that currently the pg_dump support for access methods deals
with table access methods in a manner similar to the way we deal with
tablespaces. Instead of specifying the AM on every table creation, we
set the default AM when needed.  That makes it easier to adjust dumps.

But it does basically require breaking archive compatibility.  I
personally am OK with that, but I thought it might be worth discussing.

I guess we could try avoid the compat issue by only increasing the
archive format if there actually are any non-default AMs, but to me that
doesn't seem like an improvement worthy of the necessary complications.

Greetings,

Andres Freund
>From 2f37df9c3da0eb9cd75709a35d898269b0da2a0c Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Thu, 17 Jan 2019 14:11:35 -0800
Subject: [PATCH v14 1/2] tableam: introduce + minimal infrastructure.

Author: Haribabu Kommi, Andres Freund, Alvaro Herrera, Dimitri Golgov and others
Discussion: https://postgr.es/m/
---
 src/backend/access/heap/Makefile  |   2 +-
 src/backend/access/heap/heapam_handler.c  |  45 ++
 src/backend/access/table/Makefile |   2 +-
 src/backend/access/table/tableam.c|  18 +++
 src/backend/access/table/tableamapi.c | 172 ++
 src/backend/bootstrap/bootparse.y |   2 +
 src/backend/catalog/genbki.pl |   4 +
 src/backend/catalog/heap.c|  22 +++
 src/backend/catalog/index.c   |   1 +
 src/backend/catalog/toasting.c|   1 +
 src/backend/commands/amcmds.c |  28 ++--
 src/backend/commands/cluster.c|   1 +
 src/backend/commands/createas.c   |   1 +
 src/backend/commands/tablecmds.c  |  33 +
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/parser/gram.y | 100 -
 src/backend/rewrite/rewriteDefine.c   |   1 +
 src/backend/utils/adt/pseudotypes.c   |   1 +
 src/backend/utils/cache/relcache.c| 125 +++-
 src/backend/utils/misc/guc.c  |  12 ++
 src/bin/psql/describe.c   |  25 +++-
 src/bin/psql/settings.h   |   1 +
 src/bin/psql/startup.c|   8 +
 src/include/access/tableam.h  |  48 ++
 src/include/catalog/heap.h|   2 +
 src/include/catalog/pg_am.dat |   3 +
 src/include/catalog/pg_am.h   |   1 +
 src/include/catalog/pg_class.dat  |   8 +-
 src/include/catalog/pg_class.h|   2 +-
 src/include/catalog/pg_proc.dat   |  13 ++
 src/include/catalog/pg_type.dat   |   5 +
 src/include/nodes/nodes.h |   1 +
 src/include/nodes/parsenodes.h|   1 +
 src/include/nodes/primnodes.h |   1 +
 src/include/utils/rel.h   |  15 +-
 src/include/utils/relcache.h  |   3 +
 src/test/regress/expected/create_am.out   |  79 ++
 src/test/regress/expected/opr_sanity.out  |  19 ++-
 src/test/regress/expected/psql.out|  40 +
 src/test/regress/expected/type_sanity.out |   6 +-
 src/test/regress/pg_regress_main.c|   7 +-
 src/test/regress/sql/create_am.sql|  47 ++
 src/test/regress/sql/opr_sanity.sql   |  16 +-
 src/test/regress/sql/psql.sql |  15 ++
 src/test/regress/sql/type_sanity.sql  |   7 +-
 src/tools/pgindent/typedefs.list  |   1 +
 46 files changed, 872 insertions(+), 74 deletions(-)
 create mode 100644 src/backend/access/heap/heapam_handler.c
 create mode 100644 src/backend/access/table/tableam.c
 create mode 100644 src/backend/access/table/tableamapi.c
 create mode 100644 src/include/access/tableam.h

diff --git a/src/backend/access/heap/Makefile b/src/backend/access/heap/Makefile
index eae36fdbf40..b2a017249b8 100644
--- a/src/backend/access/heap/Makefile
+++ b/src/backend/access/heap/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/access/heap
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = heapam.o  heapam_visibility.o hio.o pruneheap.o rewriteheap.o \
+OBJS = heapam.o heapam_handler.o heapam_visibility.o hio.o p

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Tomas Vondra
On 3/4/19 6:55 PM, Laurenz Albe wrote:
> Masahiko Sawada wrote:
>> Why do people want to just encrypt everything? For satisfying some
>> security compliance?
> 
> I'd say that TDE primarily protects you from masked ninjas that
> break into your server room and rip out the disks with your database
> on them.
> 
> Or from people stealing your file system backups that you leave
> lying around in public.
> 
> My guess is that this requirement almost always comes from security
> departments that don't know a lot about the typical security threats
> that databases face, or (worse) from lawmakers.
> 
> And these are probably the people who will insist that *everything*
> is encrypted, even your commit log (unencrypted log? everyone can
> read the commits?).
> 

IMHO it's a sound design principle - deny access by default, then allow
for specific cases. It's much easier to reason about, and also validate
such solutions.

It's pretty much the same reason why firewall rules generally prohibit
everything by default, and then only allow access for specific ports,
from specific IP ranges, etc. Doing it the other way around is futile.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Tom Lane
Andres Freund  writes:
> One thing I want to bring up here rather than in the pluggable storage
> thread is that currently the pg_dump support for access methods deals
> with table access methods in a manner similar to the way we deal with
> tablespaces. Instead of specifying the AM on every table creation, we
> set the default AM when needed.  That makes it easier to adjust dumps.

Hm.  I wonder if it'd make more sense to consider that an access method is
a property of a tablespace?  That is, all tables in a tablespace have the
same access method, so you don't need to label tables individually?

> But it does basically require breaking archive compatibility.  I
> personally am OK with that, but I thought it might be worth discussing.

I don't recall there being huge pushback when we did that in the past,
so I'm fine with it as long as there's an identifiable feature making
it necessary.

regards, tom lane



Re: GiST VACUUM

2019-03-04 Thread Andrey Borodin
Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together 
seems so obvious now, but never actually visited my mind before.

> 4 марта 2019 г., в 18:04, Heikki Linnakangas  написал(а):
> Thanks! As I noted at 
> https://www.postgresql.org/message-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e%40iki.fi,
>  the test patch left the index corrupt. I fixed it so that it leaves behind 
> incompletely split pages, without the corruption, see attached. It's similar 
> to yours, but let me recap what it does:
> 
> * Hacks gistbuild(), create 100 empty pages immediately after the root pages. 
> They are leaked, so they won't be reused until the a VACUUM sees them and 
> puts them to the FSM
> 
> * Hacks gistinserttuples(), to leave the split incompleted with 50% 
> probability
> 
> * In vacuum, print a line to the log whenever it needs to "jump left"
> 
> I used this patch, with the attached test script that's similar to yours, but 
> it also tries to verify that the index returns correct results. It prints a 
> result set like this:
> 
>   sum
> -
> -364450
>  364450
> (2 rows)
> 
> If the two numbers add up to 0, the index seems valid. And you should see 
> "RESCAN" lines in the log, to show that jumping left happened. Because the 
> behavior is random and racy, you may need to run the script many times to see 
> both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. 
> Especially the "FollowRight" case happens less frequently than the NSN case, 
> you might need to run the script > 10 times to see it.
Great! I've repeated your tests on my machine, I observe similar frequencies of 
causes of rescan left jumps.

> I also tried your amcheck tool with this. It did not report any errors.
> 
> Attached is also latest version of the patch itself. It is the same as your 
> latest patch v19, except for some tiny comment kibitzing. I'll mark this as 
> Ready for Committer in the commitfest app, and will try to commit it in the 
> next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data 
structure prettier and less hacky.

Best regards, Andrey Borodin.


Re: Segfault when restoring -Fd dump on current HEAD

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 13:25:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > One thing I want to bring up here rather than in the pluggable storage
> > thread is that currently the pg_dump support for access methods deals
> > with table access methods in a manner similar to the way we deal with
> > tablespaces. Instead of specifying the AM on every table creation, we
> > set the default AM when needed.  That makes it easier to adjust dumps.
> 
> Hm.  I wonder if it'd make more sense to consider that an access method is
> a property of a tablespace?  That is, all tables in a tablespace have the
> same access method, so you don't need to label tables individually?

I don't think that'd work well. That'd basically necessitate creating
multiple tablespaces just to create a table with a different AM -
creating tablespaces is a superuser only activity that makes backups etc
more complicated. It also doesn't correspond well to pg_class.relam etc.


> > But it does basically require breaking archive compatibility.  I
> > personally am OK with that, but I thought it might be worth discussing.
> 
> I don't recall there being huge pushback when we did that in the past,
> so I'm fine with it as long as there's an identifiable feature making
> it necessary.

Cool.

Greetings,

Andres Freund



Re: query logging of prepared statements

2019-03-04 Thread Justin Pryzby
On Mon, Mar 04, 2019 at 06:53:31PM +0300, Arthur Zakirov wrote:
> Hello Justin,
> 
> On 27.02.2019 21:06, Justin Pryzby wrote:
> >I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather 
> >than
> >making such logs conditional on log_error_verbosity=VERBOSE, since
> >log_error_verbosity is documented to control whether these are output:
> >DETAIL/HINT/QUERY/CONTEXT/SQLSTATE.
> I looked the patch. I got interesting result with different parameters.
> 
> But with log_statement='none' and log_min_duration_statement='0' I get:
> 
> =# execute test_ins(3);
> LOG:  duration: 8.439 ms  statement: execute test_ins(3);
> DETAIL:  prepare: prepare test_ins (int) as
> insert into test values ($1);
> 
> Is it intended? In the second result I got the query details.

It wasn't intentional.  Find attached v3 patch which handles that case,
by removing the 2nd call to errdetail_execute() ; since it's otherwise unused,
so remove that function entirely.

|postgres=# execute test_ins(3);
|2019-03-04 11:56:15.997 EST [4044] LOG:  duration: 0.637 ms  statement: 
execute test_ins(3);

I also fixed a 2nd behavior using library call pqExecPrepared with
log_min_duration_statement=0.

It was logging:
|LOG:  duration: 0.264 ms  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  duration: 0.027 ms  bind q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'
|LOG:  duration: 0.006 ms  execute q: SELECT 1; PREPARE q AS SELECT $1
|DETAIL:  parameters: $1 = '1'

But now logs:

PGPORT=5679 PGHOST=/tmp PYTHONPATH=../PyGreSQL/build/lib.linux-x86_64-2.7/ 
python2.7 -c "import pg; d=pg.DB('postgres'); d.query('SET 
client_min_messages=error; SET log_error_verbosity=default; SET 
log_min_duration_statement=0; SET log_statement=\"none\"'); d.query('SELECT 1; 
PREPARE q AS SELECT \$1'); d.query_prepared('q',[1])"
|LOG:  duration: 0.479 ms  statement: SELECT 1; PREPARE q AS SELECT $1
|LOG:  duration: 0.045 ms  bind q
|DETAIL:  parameters: $1 = '1'
|LOG:  duration: 0.008 ms  execute q
|DETAIL:  parameters: $1 = '1'

Thanks for reviewing.  I'm also interested in discussion about whether this
change is undesirable for someone else for some reason ?  For me, the existing
output seems duplicative and "denormalized".  :)

Justin
>From c04f2fe815a55babe6a9bdd53421d74fc283094b Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 9 Feb 2019 19:20:43 -0500
Subject: [PATCH v3] Avoid repetitive log of PREPARE during EXECUTE of prepared
 statements

---
 src/backend/tcop/postgres.c | 54 +++--
 1 file changed, 8 insertions(+), 46 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..f348475ea3 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -182,7 +182,6 @@ static int	ReadCommand(StringInfo inBuf);
 static void forbidden_in_wal_sender(char firstchar);
 static List *pg_rewrite_query(Query *query);
 static bool check_log_statement(List *stmt_list);
-static int	errdetail_execute(List *raw_parsetree_list);
 static int	errdetail_params(ParamListInfo params);
 static int	errdetail_abort(void);
 static int	errdetail_recovery_conflict(void);
@@ -1041,8 +1040,7 @@ exec_simple_query(const char *query_string)
 	{
 		ereport(LOG,
 (errmsg("statement: %s", query_string),
- errhidestmt(true),
- errdetail_execute(parsetree_list)));
+ errhidestmt(true)));
 		was_logged = true;
 	}
 
@@ -1292,8 +1290,7 @@ exec_simple_query(const char *query_string)
 			ereport(LOG,
 	(errmsg("duration: %s ms  statement: %s",
 			msec_str, query_string),
-	 errhidestmt(true),
-	 errdetail_execute(parsetree_list)));
+	 errhidestmt(true)));
 			break;
 	}
 
@@ -1929,12 +1926,11 @@ exec_bind_message(StringInfo input_message)
 			break;
 		case 2:
 			ereport(LOG,
-	(errmsg("duration: %s ms  bind %s%s%s: %s",
+	(errmsg("duration: %s ms  bind %s%s%s",
 			msec_str,
 			*stmt_name ? stmt_name : "",
 			*portal_name ? "/" : "",
-			*portal_name ? portal_name : "",
-			psrc->query_string),
+			*portal_name ? portal_name : ""),
 	 errhidestmt(true),
 	 errdetail_params(params)));
 			break;
@@ -2062,14 +2058,13 @@ exec_execute_message(const char *portal_name, long max_rows)
 	if (check_log_statement(portal->stmts))
 	{
 		ereport(LOG,
-(errmsg("%s %s%s%s: %s",
+(errmsg("%s %s%s%s",
 		execute_is_fetch ?
 		_("execute fetch from") :
 		_("execute"),
 		prepStmtName,
 		*portal_name ? "/" : "",
-		*portal_name ? portal_name : "",
-		sourceText),
+		*portal_name ? portal_name : ""),
  errhidestmt(true),
  errdetail_params(portalParams)));
 		was_logged = true;
@@ -2150,15 +2145,14 @@ exec_execute_message(const char *portal_name, long max_rows)
 			break;
 		case 2:
 			ereport(LOG,
-	(errmsg("duration: %s ms  %s %s%s%s: %s",
+	(errmsg("duration: %s ms  %s %s%s%s",
 			msec_str,
 			execute_is_fetch ?
 			_("execute 

Re: \describe*

2019-03-04 Thread Corey Huinker
>
>
>> - Tab completion for \descibe-verbose.
>> I know that \d+ tab completion is also not there, but I think we must
>> have tab completion for \descibe-verbose.
>>
>> postgres=# \describe-
>> \describe-extension
>>  \describe-replication-publication \describe-user-mapping
>> \describe-foreign-data-wrapper
>> \describe-replication-subscription\describe-view
>> \describe-foreign-server  \describe-role
>>   \describe-window-function
>> \describe-foreign-table   \describe-rule
>>  ...
>>
>
I just confirmed that there isn't tab completion for the existing S/+
options, so it's hard to justify them for the equivalent verbose suffixes.



> (1 row)
>> Invalid command \describe. Try \? for help.
>>
>>
>> I think this status is causing the problem.
>>
>>
>>
>> +   /*
>> standard listing of interesting things */
>> +   success =
>> listTables("tvmsE", NULL, show_verbose, show_system);
>> +   }
>> +   status = PSQL_CMD_UNKNOWN;
>>
>>
I'll look into this, thanks!



> - Confusion about \desc and \desC
>> There is confusion while running the \desc command. I know the problem,
>> but the user may confuse by this.
>> postgres=# \desC
>>List of foreign servers
>>  Name | Owner | Foreign-data wrapper
>> --+---+--
>> (0 rows)
>>
>> postgres=# \desc
>> Invalid command \desc. Try \? for help.
>>
>> - Auto-completion of commands.
>> There is some more confusion in the completion of commands.
>>
>> This command shows List of aggregates.
>> postgres=# \describe-aggregate-function
>>  List of aggregate functions
>>  Schema | Name | Result data type | Argument data types | Description
>> +--+--+-+-
>> (0 rows)
>>
>>
>>
>> This command shows a list of relation "\d"
>> postgres=# \describe-aggregatE-function
>> List of relations
>>  Schema | Name | Type  |  Owner
>> +--+---+-
>>  public | foo  | table | vagrant
>> (1 row)
>>
>> This command also shows a list of relations "\d".
>> postgres=# \describe-aggr
>> List of relations
>>  Schema | Name | Type  |  Owner
>> +--+---+-
>>  public | foo  | table | vagrant
>> (1 row)
>>
>> This command shows error messages.
>> postgres=# \descr
>> Invalid command \descr. Try \? for help.
>>
>>
I will look into it.



>
>> I have done a brief code review except for the documentation code. I
>> don't like this code
>>
>> if (cmd_match(cmd,"describe-aggregate-function"))
>>
>>  success = describeAggregates(pattern, show_verbose, show_system);
>>  else if (cmd_match(cmd,
>> "describe-access-method"))
>>  success = describeAccessMethods(pattern,
>> show_verbose);
>>  else if (cmd_match(cmd,
>> "describe-tablespace"))
>>  success = describeTablespaces(pattern,
>> show_verbose);
>>  else if (cmd_match(cmd,
>> "describe-conversion"))
>>  success = listConversions(pattern,
>> show_verbose, show_system);
>>  else if (cmd_match(cmd, "describe-cast"))
>>  success = listCasts(pattern, show_verbose
>>
>>
>> This can be achieved with the list/array/hash table, so I have changed
>> that code in the attached patch just for a sample if you want I can do that
>> for whole code.
>>
>
There's some problems with a hash table. The function signatures vary quite
a lot, and some require additional psql_scan_slash_options to be called.
The hash option, if implemented, probably should be expanded to all slash
commands, at which point maybe it belongs in psqlscanslash.l...

>


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-03-04 Thread Tomas Vondra



On 3/4/19 6:40 AM, Masahiko Sawada wrote:
> On Sat, Mar 2, 2019 at 5:27 AM Robert Haas  wrote:
>>
>> On Thu, Feb 7, 2019 at 3:28 AM Masahiko Sawada  wrote:
>>> WAL encryption will follow as an additional feature.
>>
>> I don't think WAL encryption is an optional feature.  You can argue
>> about whether it's useful to encrypt the disk files in the first place
>> given that there's no privilege boundary between the OS user and the
>> database, but a lot of people seem to think it is and maybe they're
>> right.  However, who can justify encrypting only SOME of the disk
>> files and not others?  I mean, maybe you could justify not encryption
>> the SLRU files on the grounds that they probably don't leak much in
>> the way of interesting information, but the WAL files certainly do --
>> your data is there, just as much as in the data files themselves.
>>
> 
> Agreed.
> 
>> To be honest, I think there is a lot to like about the patches
>> Cybertec has proposed.  Those patches don't have all of the fancy
>> key-management stuff that you are proposing here, but maybe that
>> stuff, if we want it, could be added, rather than starting over from
>> scratch.  It seems to me that those patches get a lot of things right.
>> In particular, it looked to me when I looked at them like they made a
>> pretty determined effort to encrypt every byte that might go down to
>> the disk.  It seems to me that you if you want encryption, you want
>> that.
>>
> 
> Agreed. I think the patch lacks the key management stuff: 2-tier key
> architecture and integration of postgres with key management systems.
> I'd like to work together and can propose the patch of key management
> stuff to the proposed patch.
> 

Sounds like a plan. It'd be nice to come up with a unified version of
those two patches, combining the good pieces from both.

I wonder how other databases deal with key management? Surely we're not
the first/only database that tries to do transparent encryption, so
perhaps we could learn something from others? For example, do they use
this 2-tier key architecture? How do they do key management? etc.

I don't say we should copy from them, but it'd allow us to (a) avoid
making the same mistakes and (b) build a solution the users are already
somewhat familiar with.

May I suggest creating a page on the PostgreSQL wiki, explaining the
design and updating it as the discussion develops? It's rather difficult
to follow all the different sub-threads, and IIRC some larger patches
used that successfully for this purpose.

See for example:

* https://wiki.postgresql.org/wiki/Parallel_External_Sort
* https://wiki.postgresql.org/wiki/Parallel_Internal_Sort


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pg_partition_tree crashes for a non-defined relation

2019-03-04 Thread Alvaro Herrera
On 2019-Feb-28, Alvaro Herrera wrote:

> What about legacy inheritance?  I see that pg_partition_tree handles
> that case perfectly well -- it seems to return the complete hierarchy
> rooted at the given relation.  However, it seems odd that it works at
> all, don't you think?  Consider this:
> [...]
> I would opt for returning the empty set for legacy inheritance too.

I added tests to immortalize the current behavior for legacy inheritance
relations too.  Thanks!

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-02 18:11:43 -0500, Tom Lane wrote:
> On test cases like "pg_bench -S" it seems to be pretty much within the
> noise level of being the same speed as HEAD.

I think that might be because it's bottleneck is just elsewhere
(e.g. very context switch heavy, very few lists of any length).

FWIW, even just taking context switches out of the equation leads to
a ~5-6 %benefit in a simple statement:

DO $f$BEGIN FOR i IN 1..50 LOOP EXECUTE $s$SELECT aid, bid, abalance, 
filler FROM pgbench_accounts WHERE aid = 2045530;$s$;END LOOP;END;$f$;

master:
+6.05%  postgres  postgres[.] AllocSetAlloc
+5.52%  postgres  postgres[.] base_yyparse
+2.51%  postgres  postgres[.] palloc
+1.82%  postgres  postgres[.] hash_search_with_hash_value
+1.61%  postgres  postgres[.] core_yylex
+1.57%  postgres  postgres[.] SearchCatCache1
+1.43%  postgres  postgres[.] expression_tree_walker.part.4
+1.09%  postgres  postgres[.] check_stack_depth
+1.08%  postgres  postgres[.] MemoryContextAllocZeroAligned

patch v3:
+5.77%  postgres  postgres[.] base_yyparse
+4.88%  postgres  postgres[.] AllocSetAlloc
+1.95%  postgres  postgres[.] hash_search_with_hash_value
+1.89%  postgres  postgres[.] core_yylex
+1.64%  postgres  postgres[.] SearchCatCache1
+1.46%  postgres  postgres[.] expression_tree_walker.part.0
+1.45%  postgres  postgres[.] palloc
+1.18%  postgres  postgres[.] check_stack_depth
+1.13%  postgres  postgres[.] MemoryContextAllocZeroAligned
+1.04%  postgres  libc-2.28.so[.] _int_malloc
+1.01%  postgres  postgres[.] nocachegetattr

And even just pgbenching the EXECUTEd statement above gives me a
reproducible ~3.5% gain when using -M simple, and ~3% when using -M
prepared.

Note than when not using prepared statement (a pretty important
workload, especially as long as we don't have a pooling solution that
actually allows using prepared statement across connections), even after
the patch most of the allocator overhead is still from list allocations,
but it's near exclusively just the "create a new list" case:

+5.77%  postgres  postgres[.] base_yyparse
-4.88%  postgres  postgres[.] AllocSetAlloc
   - 80.67% AllocSetAlloc
  - 68.85% AllocSetAlloc
 - 57.65% palloc
- 50.30% new_list (inlined)
   - 37.34% lappend
  + 12.66% pull_var_clause_walker
  + 8.83% build_index_tlist (inlined)
  + 8.80% make_pathtarget_from_tlist
  + 8.73% get_quals_from_indexclauses (inlined)
  + 8.73% distribute_restrictinfo_to_rels
  + 8.68% RewriteQuery
  + 8.56% transformTargetList
  + 8.46% make_rel_from_joinlist
  + 4.36% pg_plan_queries
  + 4.30% add_rte_to_flat_rtable (inlined)
  + 4.29% build_index_paths
  + 4.23% match_clause_to_index (inlined)
  + 4.22% expression_tree_mutator
  + 4.14% transformFromClause
  + 1.02% get_index_paths
   + 17.35% list_make1_impl
   + 16.56% list_make1_impl (inlined)
   + 15.87% lcons
   + 11.31% list_copy (inlined)
   + 1.58% lappend_oid
+ 12.90% expression_tree_mutator
+ 9.73% get_relation_info
+ 4.71% bms_copy (inlined)
+ 2.44% downcase_identifier
+ 2.43% heap_tuple_untoast_attr
+ 2.37% add_rte_to_flat_rtable (inlined)
+ 1.69% btbeginscan
+ 1.65% CreateTemplateTupleDesc
+ 1.61% core_yyalloc (inlined)
+ 1.59% heap_copytuple
+ 1.54% text_to_cstring (inlined)
+ 0.84% ExprEvalPushStep (inlined)
+ 0.84% ExecInitRangeTable
+ 0.84% scanner_init
+ 0.83% ExecInitRangeTable
+ 0.81% CreateQueryDesc
+ 0.81% _bt_search
+ 0.77% ExecIndexBuildScanKeys
+ 0.66% RelationGetIndexScan
+ 0.65% make_pathtarget_from_tlist


Given how hard it is to improve performance with as flatly distributed
costs as the above profiles, I actually think these are quite promising
results.

I'm not even convinced that it makes all that much sense to measure
end-to-end performance here, it might be worthwhile to measure with a
debugging function that allows to exercise parsing, parse-analysis,
rewrite etc at configurable loop counts. Given the relatively evenly
distributed profiles were going to have to make a few different
improvements to make headway, and it's hard to see benefits of
individual ones if you look at the overall numbers.

Greeti

Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-03 13:29:04 -0500, Tom Lane wrote:
> The cases I've been looking at suggest to me that we'd make far
> more progress on the excessive-palloc'ing front if we could redesign
> things to reduce unnecessary copying of parsetrees.  Right now the
> planner does an awful lot of copying because of fear of unwanted
> modifications of multiply-linked subtrees.  I suspect that we could
> reduce that overhead with some consistently enforced rules about
> not scribbling on input data structures; but it'd take a lot of work
> to get there, and I'm afraid it'd be easily broken :-(

Given the difficulty of this tasks, isn't your patch actually a *good*
attack on the problem? It makes copying lists considerably cheaper. As
you say, a more principled answer to this problem is hard, so attacking
it from the "make the constant factors smaller" side doesn't seem crazy?

Greetings,

Andres Freund



Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 13:11:35 -0500, Tom Lane wrote:
> The concern I have is mostly about the use of lists as core infrastructure
> in parsetree, plantree, etc data structures.  I think any idea that we'd
> replace those piecemeal is borderline insane: it's simply not worth it
> from a notational and bug-risk standpoint to glue together some parts of
> those structures differently from the way other parts are glued together.

I don't buy this. I think e.g. redisgning the way we represent
targetlists would be good (it's e.g. insane that we recompute
descriptors out of them all the time), and would reduce their allocator
costs.

Greetings,

Andres Freund



Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey
So I am getting much closer to a working implementation in PostGIS,
but have just run into an issue which I am assuming is my
misunderstanding something...

https://github.com/pramsey/postgis/blob/92268c94f3aa1fc63a2941f2b451be15b28662cf/postgis/gserialized_supportfn.c#L287

I had what seemed to be working code except for a couple rare cases,
but when I fixed those cases it turned out that I had a major problem:
building a  OP  expression works fine, but building a
 OP  expression returns me an error.

create table f as select st_makepoint(200*random() - 100, 200*random()
- 100) as g from generate_series(0, 10);
create index f_g_x on f using gist (g);
explain select * from baz where st_coveredby('POINT(5 0)', geom);
explain select * from f where st_coveredby(g, 'POINT(5 0)');

QUERY PLAN
---
 Bitmap Heap Scan on f  (cost=13.36..314.58 rows=4 width=32)
   Filter: st_coveredby(g,
'0101001440'::geometry)
   ->  Bitmap Index Scan on f_g_x  (cost=0.00..5.03 rows=100 width=0)
 Index Cond: (g @
'0101001440'::geometry)

postgis=# explain select * from f where st_coveredby('POINT(5 0)', g);

ERROR:  index key does not match expected index column

Any thoughts?

P



Re: Checksum errors in pg_stat_database

2019-03-04 Thread Julien Rouhaud
On Fri, Feb 22, 2019 at 3:01 PM Magnus Hagander  wrote:
>
> It tracks things that happen in the general backends. Possibly we should also 
> consider counting the errors actually found when running base backups? OTOH, 
> that part of the code doesn't really track things like databases (as it 
> operates just on the raw data directory underneath), so that implementation 
> would definitely not be as clean...

Sorry I just realized that I totally forgot this part of the thread.

While it's true that we operate on raw directory, I see that sendDir()
already setup a isDbDir var, and if this is true lastDir should
contain the oid of the underlying database.  Wouldn't it be enough to
call sendFile() using this, something like (untested):

if (!sizeonly)
- sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true);
+ sent = sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf, true,
isDbDir ? pg_atoi(lastDir+1, 4) : InvalidOid);

and accordingly report any checksum error from sendFile()?



Re: psql display of foreign keys

2019-03-04 Thread Alvaro Herrera
On 2019-Feb-28, Michael Paquier wrote:

> On Wed, Feb 27, 2019 at 03:37:23PM -0300, Alvaro Herrera wrote:

> > +pg_partition_ancestors(PG_FUNCTION_ARGS)
> > +{
> > +   Oid relid = PG_GETARG_OID(0);
> > +   FuncCallContext *funcctx;
> > +   ListCell  **next;
> > +
> > +   if (!check_rel_can_be_partition(relid))
> > +   PG_RETURN_NULL();
> 
> Not returning an empty set here? ;)

Yeah, I adapted to what was there then, but in the original coding I had
the SRF_RETURN_DONE that you committed for pg_partition_tree.

> I would have added tests with pg_partition_ancestors(NULL) and
> pg_partition_ancestors(0) for consistency with the rest.

Done.

> Except that and the ancestor tracking for inheritance, the shape of
> the patch looks good to me.

Thanks for reviewing!  I have pushed with your proposed changes.

Here's the patch I'm really interested about :-)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 952d89b43543bb1c597a4fb8d4a8846a70daa0af Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Feb 2019 18:57:25 -0300
Subject: [PATCH v5] fix psql display of FKs

---
 src/bin/psql/describe.c   | 134 --
 src/test/regress/expected/foreign_key.out |  26 ++---
 2 files changed, 114 insertions(+), 46 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 4da6719ce71..6dac5fa3009 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -1479,6 +1479,7 @@ describeOneTableDetails(const char *schemaname,
 		bool		rowsecurity;
 		bool		forcerowsecurity;
 		bool		hasoids;
+		bool		ispartition;
 		Oid			tablespace;
 		char	   *reloptions;
 		char	   *reloftype;
@@ -1501,7 +1502,24 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "false AS relhasoids, %s, c.reltablespace, "
+		  "false AS relhasoids, c.relispartition, %s, c.reltablespace, "
+		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
+		  "c.relpersistence, c.relreplident\n"
+		  "FROM pg_catalog.pg_class c\n "
+		  "LEFT JOIN pg_catalog.pg_class tc ON (c.reltoastrelid = tc.oid)\n"
+		  "WHERE c.oid = '%s';",
+		  (verbose ?
+		   "pg_catalog.array_to_string(c.reloptions || "
+		   "array(select 'toast.' || x from pg_catalog.unnest(tc.reloptions) x), ', ')\n"
+		   : "''"),
+		  oid);
+	}
+	else if (pset.sversion >= 11)
+	{
+		printfPQExpBuffer(&buf,
+		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
+		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
+		  "c.relhasoids, c.relispartition, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1518,7 +1536,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, c.relrowsecurity, c.relforcerowsecurity, "
-		  "c.relhasoids, %s, c.reltablespace, "
+		  "c.relhasoids, false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1535,7 +1553,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence, c.relreplident\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1552,7 +1570,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtype::pg_catalog.text END, "
 		  "c.relpersistence\n"
 		  "FROM pg_catalog.pg_class c\n "
@@ -1569,7 +1587,7 @@ describeOneTableDetails(const char *schemaname,
 		printfPQExpBuffer(&buf,
 		  "SELECT c.relchecks, c.relkind, c.relhasindex, c.relhasrules, "
 		  "c.relhastriggers, false, false, c.relhasoids, "
-		  "%s, c.reltablespace, "
+		  "false, %s, c.reltablespace, "
 		  "CASE WHEN c.reloftype = 0 THEN '' ELSE c.reloftype::pg_catalog.regtyp

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan
On 3/3/19, 9:23 PM, "Masahiko Sawada"  wrote:
> FWIW, I agree that we have options for vacuum as vacuum
> command options. But for reloptions, I think if the persistence the
> setting could be problematic we should not. According to the
> discussions so far, I think VACUUM_SHRINK_ENABLED is the one option
> that can be available as both vacuum command option and reloptions.
> But I'm not sure there is good use case even if we can set
> DISABLE_INDEX_CLEANUP as reloptions.

+1

The DISABLE_INDEX_CLEANUP option is intended to help avoid transaction
ID wraparound and should not be used as a long-term VACUUM strategy
for a table.

Nathan



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 20:03:37 +, Bossart, Nathan wrote:
> On 3/3/19, 9:23 PM, "Masahiko Sawada"  wrote:
> > FWIW, I agree that we have options for vacuum as vacuum
> > command options. But for reloptions, I think if the persistence the
> > setting could be problematic we should not. According to the
> > discussions so far, I think VACUUM_SHRINK_ENABLED is the one option
> > that can be available as both vacuum command option and reloptions.
> > But I'm not sure there is good use case even if we can set
> > DISABLE_INDEX_CLEANUP as reloptions.
> 
> +1
> 
> The DISABLE_INDEX_CLEANUP option is intended to help avoid transaction
> ID wraparound and should not be used as a long-term VACUUM strategy
> for a table.

I'm not quite convinced this is right.  There's plenty sites that
practically can't use autovacuum because it might decide to vacuum the
5TB index because of 300 dead tuples in the middle of busy periods.  And
without an reloption that's not controllable.

- Andres



Re: Rare SSL failures on eelpout

2019-03-04 Thread Tom Lane
I wrote:
> Thomas Munro  writes:
>> That suggests that we could perhaps handle ECONNRESET both at startup
>> packet send time (for certificate rejection, eelpout's case) and at
>> initial query send (for idle timeout, bug #15598's case) by attempting
>> to read.  Does that make sense?

> Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> failure means there is nothing left to *read*.

After staring at the code for awhile, I am thinking that there may be
a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
specific to the OpenSSL version you're using on eelpout?  There is
not anything we could do differently in libpq, AFAICS, because it's
OpenSSL's responsibility to read any data that might be available.

I also looked into the idea that we're doing something wrong on the
server side, allowing the final error message to not get flushed out.
A plausible theory there is that SSL_shutdown is returning a WANT_READ
or WANT_WRITE error and we should retry it ... but that doesn't square
with your observation upthread that it's returning SSL_ERROR_SSL.

It's all very confusing, but I think there's a nontrivial chance
that this is an OpenSSL bug, especially since we haven't been able
to replicate it elsewhere.

regards, tom lane



Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane
Paul Ramsey  writes:
> I had what seemed to be working code except for a couple rare cases,
> but when I fixed those cases it turned out that I had a major problem:
> building a  OP  expression works fine, but building a
>  OP  expression returns me an error.

Yup, you're not supposed to do that.  The output expression *must* have
the index key on the left, it's up to you to commute the operator if
needed to make that happen.

regards, tom lane



Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane
Andres Freund  writes:
> I don't buy this. I think e.g. redisgning the way we represent
> targetlists would be good (it's e.g. insane that we recompute
> descriptors out of them all the time), and would reduce their allocator
> costs.

Maybe we're not on the same page here, but it seems to me that that'd be
addressable with pretty localized changes (eg, adding more fields to
TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
node).  But if the concern is about the amount of palloc bandwidth going
into List cells, we're not going to be able to improve that with localized
data structure changes; it'll take something like the patch I've proposed.

I *have* actually done some tests of the sort you proposed, driving
just the planner and not any of the rest of the system, but I still
didn't find much evidence of big wins.  I find it interesting that
you get different results.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan
On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
> I'm not quite convinced this is right.  There's plenty sites that
> practically can't use autovacuum because it might decide to vacuum the
> 5TB index because of 300 dead tuples in the middle of busy periods.  And
> without an reloption that's not controllable.

Wouldn't it be better to adjust the cost and threshold parameters or
to manually vacuum during quieter periods?  I suppose setting
DISABLE_INDEX_CLEANUP on a relation during busy periods could be
useful if you really need to continue reclaiming transaction IDs, but
that seems like an easy way to accidentally never vacuum indexes.

Nathan



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 21:40:53 +, Bossart, Nathan wrote:
> On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
> > I'm not quite convinced this is right.  There's plenty sites that
> > practically can't use autovacuum because it might decide to vacuum the
> > 5TB index because of 300 dead tuples in the middle of busy periods.  And
> > without an reloption that's not controllable.
>
> Wouldn't it be better to adjust the cost and threshold parameters or
> to manually vacuum during quieter periods?

No. (auto)vacuum is useful to reclaim space etc. It's just the
unnecessary index cleanup that's the problem...  Most of the space can
be reclaimed after all, the item pointer ain't that big...


> I suppose setting DISABLE_INDEX_CLEANUP on a relation during busy
> periods could be useful if you really need to continue reclaiming
> transaction IDs, but that seems like an easy way to accidentally never
> vacuum indexes.

Yea, I do think that's a danger. But we allow disabling autovacuum, so
I'm not sure it matters that much... And for indexes you'd still have
the index page-level vacuum that'd continue to work.

- Andres



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2019-03-04 Thread Peter Geoghegan
On Sun, Mar 3, 2019 at 10:02 PM Heikki Linnakangas  wrote:
> Some comments on
> v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch below.
> Mostly about code comments. In general, I think a round of copy-editing
> the comments, to use simpler language, would do good. The actual code
> changes look good to me.

I'm delighted that the code looks good to you, and makes sense
overall. I worked hard to make the patch a natural adjunct to the
existing code, which wasn't easy.

> Seems confusing to first say assertively that "*bufptr contains the page
> that the new tuple unambiguously belongs to", and then immediately go on
> to list a whole bunch of exceptions. Maybe just remove "unambiguously".

This is fixed in v14 of the patch series.

> This happens very seldom, because you only get an incomplete split if
> you crash in the middle of a page split, and that should be very rare. I
> don't think we need to list more fine-grained conditions here, that just
> confuses the reader.

Fixed in v14.

> > /*
> >  *_bt_useduplicatepage() -- Settle for this page of duplicates?

> So, this function is only used for legacy pg_upgraded indexes. The
> comment implies that, but doesn't actually say it.

I made that more explicit in v14.

> > /*
> >  * Get tie-breaker heap TID attribute, if any.  Macro works with both pivot
> >  * and non-pivot tuples, despite differences in how heap TID is represented.

> > #define BTreeTupleGetHeapTID(itup) ...

I fixed up the comments above BTreeTupleGetHeapTID() significantly.

> The comment claims that "all pivot tuples must be as of BTREE_VERSION
> 4". I thought that all internal tuples are called pivot tuples, even on
> version 3.

In my mind, "pivot tuple" is a term that describes any tuple that
contains a separator key, which could apply to any nbtree version.
It's useful to have a distinct term (to not just say "separator key
tuple") because Lehman and Yao think of separator keys as separate and
distinct from downlinks. Internal page splits actually split *between*
a separator key and a downlink. So nbtree internal page splits must
split "inside a pivot tuple", leaving its separator on the left hand
side (new high key), and its downlink on the right hand side (new
minus infinity tuple).

Pivot tuples may contain a separator key and a downlink, just a
downlink, or just a separator key (sometimes this is implicit, and the
block number is garbage). I am particular about the terminology
because the pivot tuple vs. downlink vs. separator key thing causes a
lot of confusion, particularly when you're using Lehman and Yao (or
Lanin and Shasha) to understand how things work in Postgres.

We wan't to have a broad term that refers to the tuples that describe
the keyspace (pivot tuples), since it's often helpful to refer to them
collectively, without seeming to contradict Lehman and Yao.

> I think what this means to say is that this macro is only
> used on BTREE_VERSION 4 indexes. Or perhaps that pivot tuples can only
> have a heap TID in BTREE_VERSION 4 indexes.

My high level approach to pg_upgrade/versioning is for index scans to
*pretend* that every nbtree index (even on v2 and v3) has a heap
attribute that actually makes the keys unique. The difference is that
v4 gets to use a scantid, and actually rely on the sort order of heap
TIDs, whereas pg_upgrade'd indexes "are not allowed to look at the
heap attribute", and must never provide a scantid (they also cannot
use the !minusinfkey optimization, but this is only an optimization
that v4 indexes don't truly need). They always do the right thing
(move left) on otherwise-equal pivot tuples, since they have no
scantid.

That's why _bt_compare() can use BTreeTupleGetHeapTID() without caring
about the version the index uses. It might be NULL for a pivot tuple
in a v3 index, even though we imagine/pretend that it should have a
value set. But that doesn't matter, because higher level code knows
that !heapkeyspace indexes should never get a scantid (_bt_compare()
does Assert() that they got that detail right, though). We "have no
reason to peak", because we don't have a scantid, so index scans work
essentially the same way, regardless of the version in use.

There are a few specific cross-version things that we need think about
outside of making sure that there is never a scantid (and !minusinfkey
optimization is unused) in < v4 indexes, but these are all related to
unique indexes. "Pretending that all indexes have a heap TID" is a
very useful mental model. Nothing really changes, even though you
might guess that changing the classic "Subtree S is described by Ki <
v <= Ki+1" invariant would need to break code in
_bt_binsrch()/_bt_compare(). Just pretend that the classic invariant
was there since the Berkeley days, and don't do anything that breaks
the useful illusion on versions before v4.

> This macro (and many others in nbtree.h) is quite complicated. A static
> inline function might be easier to read.

I agree that the macros a

Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey



> On Mar 4, 2019, at 1:13 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> I had what seemed to be working code except for a couple rare cases,
>> but when I fixed those cases it turned out that I had a major problem:
>> building a  OP  expression works fine, but building a
>>  OP  expression returns me an error.
> 
> Yup, you're not supposed to do that.  The output expression *must* have
> the index key on the left, it's up to you to commute the operator if
> needed to make that happen.

Gotcha, done and now have an implementation that passes all our regression 
tests.

Thanks!

P


[Proposal] TOAST'ing in slices

2019-03-04 Thread Bruno Hass
Hello Hackers,

I'm intending to optimize some varlena data types as my GSoC proposal. That 
would be done by a smarter way of splitting the TOAST table chunks, depending 
on its data type. A JSONB would be split considering its internal tree 
structure, keeping track of which keys are in each chunk. Arrays and text 
fields are candidates for optimization as well, as I outlined in my proposal 
draft (attached). Not being very familiar with the source code, I am writing to 
this list seeking some guidance in where to look in the source code in order to 
detail my implementation. Furthermore, I have a couple of questions:

  *   Would it be a good idea to modify the TOAST table row to keep metadata on 
the data it stores?
  *   This proposal will modify how JSONB, array and text fields are TOASTed. 
Where can I find the code related to that?

Kind regards,

Bruno Hass



PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf


Re: Rare SSL failures on eelpout

2019-03-04 Thread Thomas Munro
On Tue, Mar 5, 2019 at 10:08 AM Tom Lane  wrote:
> I wrote:
> > Thomas Munro  writes:
> >> That suggests that we could perhaps handle ECONNRESET both at startup
> >> packet send time (for certificate rejection, eelpout's case) and at
> >> initial query send (for idle timeout, bug #15598's case) by attempting
> >> to read.  Does that make sense?
>
> > Hmm ... it definitely makes sense that we shouldn't assume that a *write*
> > failure means there is nothing left to *read*.
>
> After staring at the code for awhile, I am thinking that there may be
> a bug of that ilk, but if so it's down inside OpenSSL.  Perhaps it's
> specific to the OpenSSL version you're using on eelpout?  There is
> not anything we could do differently in libpq, AFAICS, because it's
> OpenSSL's responsibility to read any data that might be available.
>
> I also looked into the idea that we're doing something wrong on the
> server side, allowing the final error message to not get flushed out.
> A plausible theory there is that SSL_shutdown is returning a WANT_READ
> or WANT_WRITE error and we should retry it ... but that doesn't square
> with your observation upthread that it's returning SSL_ERROR_SSL.
>
> It's all very confusing, but I think there's a nontrivial chance
> that this is an OpenSSL bug, especially since we haven't been able
> to replicate it elsewhere.

Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
remember that very few animals are running the ssl tests; also it
requires particular timing to hit.

OK, here's something.  I can reproduce it quite easily on this
machine, and I can fix it like this:

diff --git a/src/interfaces/libpq/fe-connect.c
b/src/interfaces/libpq/fe-connect.c
index f29202db5f..e9c137f1bd 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2705,6 +2705,7 @@ keep_going:
 /* We will come back to here until there is

   libpq_gettext("could not send startup packet: %s\n"),

   SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
free(startpacket);
+   pqHandleSendFailure(conn);
goto error_return;
}

If I add some printf debugging in there, I can see that block being
reached every hundred or so times I try to connect with a revoked
certificate, and with that extra call to pqHandleSendFailure() in
there the error comes out as it should:

psql: SSL error: sslv3 alert certificate revoked

Now I'm confused because we already have handling like that in
PQsendQuery(), so I can't explain bug #15598.

-- 
Thomas Munro
https://enterprisedb.com



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan
On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
> On 2019-03-04 21:40:53 +, Bossart, Nathan wrote:
>> On 3/4/19, 12:11 PM, "Andres Freund"  wrote:
>> > I'm not quite convinced this is right.  There's plenty sites that
>> > practically can't use autovacuum because it might decide to vacuum the
>> > 5TB index because of 300 dead tuples in the middle of busy periods.  And
>> > without an reloption that's not controllable.
>>
>> Wouldn't it be better to adjust the cost and threshold parameters or
>> to manually vacuum during quieter periods?
>
> No. (auto)vacuum is useful to reclaim space etc. It's just the
> unnecessary index cleanup that's the problem...  Most of the space can
> be reclaimed after all, the item pointer ain't that big...

I see what you mean.

>> I suppose setting DISABLE_INDEX_CLEANUP on a relation during busy
>> periods could be useful if you really need to continue reclaiming
>> transaction IDs, but that seems like an easy way to accidentally never
>> vacuum indexes.
>
> Yea, I do think that's a danger. But we allow disabling autovacuum, so
> I'm not sure it matters that much... And for indexes you'd still have
> the index page-level vacuum that'd continue to work.

I think the difference here is that there isn't something like
autovacuum_freeze_max_age to force index cleanup at some point.
Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
at least there's a fallback available.

Nathan



Re: POC: converting Lists into arrays

2019-03-04 Thread Bruce Momjian
On Mon, Mar  4, 2019 at 12:44:41PM -0500, Robert Haas wrote:
> I think that's a separate but also promising thing to attack, and I
> agree that it'd take a lot of work to get there.  I don't think that
> the problem with either parse-tree-copying or list usage is that no
> performance benefits are to be had; I think it's that the amount of
> work required to get those benefits is pretty large.  If it were
> otherwise, somebody likely would have done it before now.

Stupid question, but do we use any kind of reference counter to know if
two subsystems look at a structure, and a copy is required?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 22:00:47 +, Bossart, Nathan wrote:
> On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
> > Yea, I do think that's a danger. But we allow disabling autovacuum, so
> > I'm not sure it matters that much... And for indexes you'd still have
> > the index page-level vacuum that'd continue to work.
> 
> I think the difference here is that there isn't something like
> autovacuum_freeze_max_age to force index cleanup at some point.
> Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
> at least there's a fallback available.

Well, but your cluster doesn't suddenly shut down because of index bloat
(in contrast to xid wraparound). So I don't quite see an equivalent need
for an emergency valve.  I think we should just put a warning into the
reloption's docs, and leave it at that.

Greetings,

Andres Freund



Re: POC: converting Lists into arrays

2019-03-04 Thread Bruce Momjian
On Mon, Mar  4, 2019 at 01:11:35PM -0500, Tom Lane wrote:
> Robert Haas  writes:
> > I think the reason why you're not seeing a performance benefit is
> > because the problem is not that lists are generically a more expensive
> > data structure than arrays, but that there are cases when they are
> > more expensive than arrays.  If you only ever push/pop at the front,
> > of course a list is going to be better.  If you often look up elements
> > by index, of course an array is going to be better.  If you change
> > every case where the code currently uses a list to use something else
> > instead, then you're changing both the winning and losing cases.
> 
> I don't think this argument is especially on-point, because what I'm
> actually seeing is just that there aren't any list operations that
> are expensive enough to make much of an overall difference in
> typical queries.  To the extent that an array reimplementation
> reduces the palloc traffic, it'd take some load off that subsystem,
> but apparently you need not-typical queries to really notice.
> (And, if the real motivation is aggregate palloc savings, then yes you
> really do want to replace everything...)

Could it be that allocating List* structures near the structure it
points to is enough of a benefit in terms of cache hits that it is a
loss when moving to a List* array?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: POC: converting Lists into arrays

2019-03-04 Thread Peter Geoghegan
On Mon, Mar 4, 2019 at 2:04 PM Bruce Momjian  wrote:
> On Mon, Mar  4, 2019 at 12:44:41PM -0500, Robert Haas wrote:
> > I think that's a separate but also promising thing to attack, and I
> > agree that it'd take a lot of work to get there.  I don't think that
> > the problem with either parse-tree-copying or list usage is that no
> > performance benefits are to be had; I think it's that the amount of
> > work required to get those benefits is pretty large.  If it were
> > otherwise, somebody likely would have done it before now.
>
> Stupid question, but do we use any kind of reference counter to know if
> two subsystems look at a structure, and a copy is required?

No, but I wonder if we could use Valgrind to enforce rules about who
has the right to scribble on what, when. That could make it a lot
easier to impose a new rule.

-- 
Peter Geoghegan



Re: patch to allow disable of WAL recycling

2019-03-04 Thread Jerry Jelinek
Alvaro,

Thanks for taking a look at the new patch. I'll update the patch to change
the name of the tunable to match your suggestion and I'll also go through
the cleanup you suggested. Finally, I'll try to rewrite the doc to
eliminate the confusion around the wording about allocating new blocks on
every write. I'll send out a new draft of the patch once all of these
changes are done.

Thanks again,
Jerry


On Wed, Feb 27, 2019 at 4:12 PM Alvaro Herrera 
wrote:

> On 2019-Feb-05, Jerry Jelinek wrote:
>
> > First, since last fall, we have found another performance problem related
> > to initializing WAL files. I've described this issue in more detail
> below,
> > but in order to handle this new problem, I decided to generalize the
> patch
> > so the tunable refers to running on a Copy-On-Write filesystem instead of
> > just being specific to WAL recycling. Specifically, I renamed the GUC
> > tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> > more obvious what is being tuned and will also be more flexible if there
> > are other problems in the future which are related to running on a COW
> > filesystem. I'm happy to choose a different name for the tunable if
> people
> > don't like 'wal_cow_fs'.
>
> I think the idea of it being a generic tunable for assorted behavior
> changes, rather than specific to WAL recycling, is a good one.  I'm
> unsure about your proposed name -- maybe "wal_cow_filesystem" is better?
>
> I'm rewording your doc addition a little bit.  Here's my proposal:
>
>
> This parameter should only be set to on when
> the WAL
> resides on a Copy-On-Write
> (COW)
> filesystem.
> Enabling this option adjusts behavior to take advantage of the
> filesystem characteristics (for example, recycling WAL files and
> zero-filling new WAL files are disabled).
>
> This part sounds good enough to me -- further suggestions welcome.
>
> I'm less sure about this phrase:
>
> This setting is only appropriate for filesystems which
> allocate new disk blocks on every write.
>
> Is "... which allocate new disk blocks on every write" a technique
> distinct from CoW itself?  I'm confused as to what it means, or how can
> the user tell whether they are on such a filesystem.
>
> Obviously you're thinking that ZFS is such a filesystem and everybody
> who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
> -- should they turn this option on?  Browsing the wikipedia, I find that
> Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
> I don't think either Btrfs or ReFS are realistic options to put pg_wal
> on, so let's just list the common filesystems for which users are
> supposed to enable this option ... which I think nowadays is just ZFS.
> All in all, I would replace this phrase with something like: "This
> setting should be enabled when pg_wal resides on a ZFS filesystem or
> similar." That should be weasely enough that it's clear that we expect
> users to do the homework when on unusual systems, while actively pointing
> out the most common use case.
>
> > Finally, the patch now includes bypassing the zero-fill for new WAL files
> > when wal_cow_fs is true.
>
> That makes sense.  I think all these benchmarks Tomas Vondra run are not
> valid anymore ...
>
> The attached v2 has assorted cosmetic cleanups.  If you can validate it,
> I would appreciate it.
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-04 16:28:40 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I don't buy this. I think e.g. redisgning the way we represent
> > targetlists would be good (it's e.g. insane that we recompute
> > descriptors out of them all the time), and would reduce their allocator
> > costs.
> 
> Maybe we're not on the same page here, but it seems to me that that'd be
> addressable with pretty localized changes (eg, adding more fields to
> TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
> node).  But if the concern is about the amount of palloc bandwidth going
> into List cells, we're not going to be able to improve that with localized
> data structure changes; it'll take something like the patch I've proposed.

What I'm saying is that it'd be reasonable to replace the use of list
for targetlists with 'list2' without a wholesale replacement of all the
list code, and it'd give us benefits.


> I find it interesting that you get different results.

What I reported weren't vanilla pgbench -S results, so there's that
difference. If measure the DO loop based test I posted, do you see a
difference?

Greetings,

Andres Freund



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-03-04 Thread Bossart, Nathan
On 3/4/19, 2:05 PM, "Andres Freund"  wrote:
> On 2019-03-04 22:00:47 +, Bossart, Nathan wrote:
>> On 3/4/19, 1:44 PM, "Andres Freund"  wrote:
>> > Yea, I do think that's a danger. But we allow disabling autovacuum, so
>> > I'm not sure it matters that much... And for indexes you'd still have
>> > the index page-level vacuum that'd continue to work.
>> 
>> I think the difference here is that there isn't something like
>> autovacuum_freeze_max_age to force index cleanup at some point.
>> Granted, you can set autovacuum_freeze_max_age to 2B if you want, but
>> at least there's a fallback available.
>
> Well, but your cluster doesn't suddenly shut down because of index bloat
> (in contrast to xid wraparound). So I don't quite see an equivalent need
> for an emergency valve.  I think we should just put a warning into the
> reloption's docs, and leave it at that.

That seems reasonable to me.

Nathan



Re: Rare SSL failures on eelpout

2019-03-04 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Mar 5, 2019 at 10:08 AM Tom Lane  wrote:
>> It's all very confusing, but I think there's a nontrivial chance
>> that this is an OpenSSL bug, especially since we haven't been able
>> to replicate it elsewhere.

> Hmm.  Yes, it is strange that we haven't seen it elsewhere, but
> remember that very few animals are running the ssl tests; also it
> requires particular timing to hit.

True.  I've spent some time today running the ssl tests on various
machines here, without any luck reproducing.

> OK, here's something.  I can reproduce it quite easily on this
> machine, and I can fix it like this:

>libpq_gettext("could not send startup packet: %s\n"),
>SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> free(startpacket);
> +   pqHandleSendFailure(conn);
> goto error_return;
> }

Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
constraints on what state libpq has to be in, as a consequence of which
it's called from a bunch of ad-hoc places, and can't be called from
some others.  It's kind of accidental that it will work here.

I was toying with the idea of getting rid of that function in
favor of a design like this:

* Handle any socket-write failure at some low level of libpq by
recording the fact that the error occurred (plus whatever data we
need to produce a message about it), and then starting to just
discard output data.

* Eventually, we'll try to read.  Process any available input data
as usual (and, if we get a read error, report that).  When no more
input data is available, if a socket write failure has been recorded,
report that much as if it were an incoming ERROR message, and then
shut down the connection.

This would essentially give us pqHandleSendFailure-like behavior
across the board, which might be enough to fix this problem as well as
bug #15598.  Or not ... as you say, we haven't thoroughly understood
either issue, so it's possible this wouldn't do it.

This would have the effect of prioritizing reports of socket read
failures over those of write failures, which is an interesting
behavioral change, but offhand I can't see a problem with it.

One thing that isn't real clear to me is how much timing sensitivity
there is in "when no more input data is available".  Can we assume that
if we've gotten ECONNRESET or an allied error from a write, then any
data the far end might've wished to send us is already available to
read?  In principle, since TCP allows shutting down either transmission
direction independently, the server could close the read side of its
socket while continuing to transmit data.  In practice, PG servers
don't do that; but could network timing glitches create the same effect?
Even if it's possible, does it happen enough to worry about?

The reason I'm concerned is that I don't think it'd be bright to ignore a
send error until we see input EOF, which'd be the obvious way to solve a
timing problem if there is one.  If our send error was some transient
glitch and the connection is really still open, then we might not get EOF,
but we won't get a server reply either because no message went to the
server.  You could imagine waiting some small interval of time before
deciding that it's time to report the write failure, but ugh.

In any case, consuming immediately-available data before reporting the
write error would already be a step forward over what we're doing now,
I think.

Thoughts?

regards, tom lane



Patch to document base64 encoding

2019-03-04 Thread Karl O. Pinc
Hi,

Doc patch, against master.  Documents encode() and decode() base64
format.

Builds for me.

Attached: doc_base64_v1.patch

References RFC2045 section 6.8 to define base64.

Because encode() and decode() show up in both the string
functions section and the binary string functions section
I documented in only the string functions section and hyperlinked
"base64" in both sections to the new text.


Note that XML output can also generate base64 data.  I suspect
this is done via the (different, src/common/base64.c)
pg_b64_encode() function which does not limit line length.  
In any case this patch does not touch the XML documentation.

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6765b0d584..f6b4b2c5a0 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1752,6 +1752,9 @@
 
  decode
 
+
+ base64
+
 decode(string text,
 format text)

@@ -1769,13 +1772,16 @@
 
  encode
 
+
+ base64
+
 encode(data bytea,
 format text)

text

 Encode binary data into a textual representation.  Supported
-formats are: base64, hex, escape.
+formats are: base64, hex, escape.
 escape converts zero bytes and high-bit-set bytes to
 octal sequences (\nnn) and
 doubles backslashes.
@@ -2365,6 +2371,20 @@
 format treats a NULL as a zero-element array.

 
+   
+ base64
+   
+
+   
+ The base64 encoding of the encode
+ and decode functions is that of RFC2045 section 6.8.
+ As per the RFC, encoded lines are broken at 76 characters.  However
+ instead of the MIME CRLF end-of-line marker, only a newline is used for
+ end-of-line.  The carriage-return, newline, space, and tab characters are
+ ignored by decode.  Otherwise, an error is raised
+ when decode is supplied invalid base64 data.
+   
+

See also the aggregate function string_agg in
.
@@ -3577,13 +3597,16 @@ SELECT format('Testing %3$s, %2$s, %s', 'one', 'two', 'three');
 
  encode
 
+
+ base64
+
encode(data bytea,
format text)
   
   text
   
Encode binary data into a textual representation.  Supported
-   formats are: base64, hex, escape.
+   formats are: base64, hex, escape.
escape converts zero bytes and high-bit-set bytes to
octal sequences (\nnn) and
doubles backslashes.


Re: Remove Deprecated Exclusive Backup Mode

2019-03-04 Thread Bruce Momjian
On Thu, Feb 28, 2019 at 05:08:24PM +0200, David Steele wrote:
> On 2/28/19 4:44 PM, Fujii Masao wrote:
> > On Wed, Feb 27, 2019 at 4:35 PM Laurenz Albe  
> > wrote:
> > > 
> > > Fujii Masao wrote:
> > > > So, let me clarify the situations;
> > > > 
> > > > (3) If backup_label.pending exists but recovery.signal doesn't, the 
> > > > server
> > > > ignores (or removes) backup_label.pending and do the recovery
> > > > starting the pg_control's REDO location. This case can happen if
> > > > the server crashes while an exclusive backup is in progress.
> > > > So crash-in-the-middle-of-backup doesn't prevent the recovery 
> > > > from
> > > > starting in this idea
> > > 
> > > That's where I see the problem with your idea.
> > > 
> > > It is fairly easy for someone to restore a backup and then just start
> > > the server without first creating "recovery.signal", and that would
> > > lead to data corruption.
> > 
> > Isn't this case problematic even when a backup was taken by pg_basebackup?
> > Because of lack of recovery.signal, no archived WAL files are replayed and
> > the database may not reach to the latest point.
> 
> It is problematic because we have a message encouraging users to delete
> backup_label when in fact they should be creating recovery.signal.

Should we issue a client NOTICE/WARNING message as part of
pg_start_backup to indicate what to do if the server crashes before
pg_stop_backup()?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: pg_partition_tree crashes for a non-defined relation

2019-03-04 Thread Michael Paquier
On Mon, Mar 04, 2019 at 03:56:00PM -0300, Alvaro Herrera wrote:
> I added tests to immortalize the current behavior for legacy inheritance
> relations too.  Thanks!

Makes sense, thanks.
--
Michael


signature.asc
Description: PGP signature


Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane
Paul Ramsey  writes:
> Gotcha, done and now have an implementation that passes all our regression 
> tests.

Very cool!  So the next step, I guess, is to address your original problem
by cranking up the cost estimates for these functions --- have you tried
that yet?  In principle you should be able to do that and not have any
bad planning side-effects, but this is all pretty new territory so maybe
some problems remain to be ironed out.

BTW, if you'd like me to review the code you added for this, I'd be happy
to do so.  I've never looked at PostGIS' innards, but probably I can make
sense of the code for this despite that.

regards, tom lane



Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Paul Ramsey


> On Mar 4, 2019, at 2:52 PM, Tom Lane  wrote:
> 
> Paul Ramsey  writes:
>> Gotcha, done and now have an implementation that passes all our regression 
>> tests.
> 
> Very cool!  So the next step, I guess, is to address your original problem
> by cranking up the cost estimates for these functions --- have you tried
> that yet?  In principle you should be able to do that and not have any
> bad planning side-effects, but this is all pretty new territory so maybe
> some problems remain to be ironed out.
> 
> BTW, if you'd like me to review the code you added for this, I'd be happy
> to do so.  I've never looked at PostGIS' innards, but probably I can make
> sense of the code for this despite that.

I would be ecstatic for a review, I’m sure I’ve left a million loose threads 
dangling.

P.

https://github.com/pramsey/postgis/blob/svn-trunk-supportfn/postgis/gserialized_supportfn.c#L191
 


https://github.com/pramsey/postgis/blob/svn-trunk-supportfn/postgis/postgis.sql.in#L4290
 





Re: Rare SSL failures on eelpout

2019-03-04 Thread Thomas Munro
On Tue, Mar 5, 2019 at 11:35 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > OK, here's something.  I can reproduce it quite easily on this
> > machine, and I can fix it like this:
>
> >libpq_gettext("could not send startup packet: %s\n"),
> >SOCK_STRERROR(SOCK_ERRNO, sebuf, sizeof(sebuf)));
> > free(startpacket);
> > +   pqHandleSendFailure(conn);
> > goto error_return;
> > }
>
> Yeah.  But I don't like pqHandleSendFailure all that much: it has strong
> constraints on what state libpq has to be in, as a consequence of which
> it's called from a bunch of ad-hoc places, and can't be called from
> some others.  It's kind of accidental that it will work here.
>
> I was toying with the idea of getting rid of that function in
> favor of a design like this:
>
> * Handle any socket-write failure at some low level of libpq by
> recording the fact that the error occurred (plus whatever data we
> need to produce a message about it), and then starting to just
> discard output data.
>
> * Eventually, we'll try to read.  Process any available input data
> as usual (and, if we get a read error, report that).  When no more
> input data is available, if a socket write failure has been recorded,
> report that much as if it were an incoming ERROR message, and then
> shut down the connection.
>
> This would essentially give us pqHandleSendFailure-like behavior
> across the board, which might be enough to fix this problem as well as
> bug #15598.  Or not ... as you say, we haven't thoroughly understood
> either issue, so it's possible this wouldn't do it.
>
> This would have the effect of prioritizing reports of socket read
> failures over those of write failures, which is an interesting
> behavioral change, but offhand I can't see a problem with it.

That seems to recreate (and extend) the behaviour that we see on
another TCP stacks, or on Linux with a remote connection, namely that
the first sendto() succeeds (even though future sendto() calls may
fail, we don't usually get to that because we read and discover an
application-level error message or whatever).

> One thing that isn't real clear to me is how much timing sensitivity
> there is in "when no more input data is available".  Can we assume that
> if we've gotten ECONNRESET or an allied error from a write, then any
> data the far end might've wished to send us is already available to
> read?  In principle, since TCP allows shutting down either transmission
> direction independently, the server could close the read side of its
> socket while continuing to transmit data.  In practice, PG servers
> don't do that; but could network timing glitches create the same effect?
> Even if it's possible, does it happen enough to worry about?

That is beyond my knowledge of networking; I have CC'd Andrew Gierth
in case he has something to say about that.  It certainly seems
sensible to assume that eg RST must follow any data that the other end
sent before closing, if it did indeed call close(), and in our case we
know that it never calls shutdown(SHUT_RD), so the only other
possibility seems to be that it crashed or the connection was lost.
So it seems safe to assume that we can read the server's parting words
after we've had a EPIPE/ECONNRESET error.  Based on some quick
googling, it looks like SHUT_RD doesn't actually send anything anyway
(unlike SHUT_WR which sends FIN and would result in EOF at the other
end), so I'm not sure if SHUT_RD even "exists" outside the mind of the
TCP stack doing it and therefore I'm not sure if there is any "timing
glitch" that could resemble it.  But I don't know.

> The reason I'm concerned is that I don't think it'd be bright to ignore a
> send error until we see input EOF, which'd be the obvious way to solve a
> timing problem if there is one.  If our send error was some transient
> glitch and the connection is really still open, then we might not get EOF,
> but we won't get a server reply either because no message went to the
> server.  You could imagine waiting some small interval of time before
> deciding that it's time to report the write failure, but ugh.
>
> In any case, consuming immediately-available data before reporting the
> write error would already be a step forward over what we're doing now,
> I think.
>
> Thoughts?

It seems reasonable to me on the grounds that sending data doesn't
mean the other guy got it anyway and we can see that on other TCP
stacks.  Preferring errors resulting from reads seems sensible because
it's exactly what we want in these cases, and I can't immediately
think of a downside.  It's bigger surgery than I was thinking of but
it seems like it removes some pre-existing complicated code and
replaces it with something simple, so that seems to be a win.

-- 
Thomas Munro
https://enterprisedb.com



Re: New vacuum option to do only freezing

2019-03-04 Thread Bossart, Nathan
On 2/28/19, 12:13 AM, "Masahiko Sawada"  wrote:
> Attached the updated version patch. I've incorporated all review
> comments I got and have changed the number of tuples being reported as
> 'removed tuples'. With this option, tuples completely being removed is
> only tuples marked as unused during HOT-pruning, other dead tuples are
> left. So we count those tuples during HOT-pruning and reports it as
> removed tuples.

Thanks for the new patches.  Beyond the reloptions discussion, most of
my feedback is wording suggestions.

+  VACUUM removes dead tuples and prunes HOT-updated
+  tuples chain for live tuples on table. If the table has any dead tuple
+  it removes them from both table and indexes for re-use. With this
+  option VACUUM doesn't completely remove dead tuples
+  and disables removing dead tuples from indexes.  This is suitable for
+  avoiding transaction ID wraparound (see
+  ) but not sufficient for avoiding
+  index bloat. This option is ignored if the table doesn't have index.
+  This cannot be used in conjunction with FULL
+  option.

There are a couple of small changes I would make.  How does something
like this sound?

VACUUM removes dead tuples and prunes HOT-updated tuple chains for
live tuples on the table.  If the table has any dead tuples, it
removes them from both the table and its indexes and marks the
corresponding line pointers as available for re-use.  With this
option, VACUUM still removes dead tuples from the table, but it
does not process any indexes, and the line pointers are marked as
dead instead of available for re-use.  This is suitable for
avoiding transaction ID wraparound (see Section 24.1.5) but not
sufficient for avoiding index bloat.  This option is ignored if
the table does not have any indexes.  This cannot be used in
conjunction with the FULL option.

- * Returns the number of tuples deleted from the page and sets
- * latestRemovedXid.
+ * Returns the number of tuples deleted from the page and set latestRemoveXid
+ * and increment nunused.

I would say something like: "Returns the number of tuples deleted from
the page, sets latestRemovedXid, and updates nunused."

+   /*
+* hasindex = true means two-pass strategy; false means one-pass. But we
+* always use the one-pass strategy when index vacuum is disabled.
+*/

I think the added sentence should make it more clear that hasindex
will still be true when DISABLE_INDEX_CLEANUP is used.  Maybe
something like:

/*
 * hasindex = true means two-pass strategy; false means one-pass
 *
 * If DISABLE_INDEX_CLEANUP is used, hasindex may still be true,
 * but we'll always use the one-pass strategy.
 */

tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,
-   
 &vacrelstats->latestRemovedXid);
+   
 &vacrelstats->latestRemovedXid,
+   
 &tups_pruned);

Why do we need a separate tups_pruned argument in heap_page_prune()?
Could we add the result of heap_page_prune() to tups_pruned instead,
then report the total number of removed tuples as tups_vacuumed +
tups_pruned elsewhere?

+* If there are no indexes or we skip index vacuum then we can 
vacuum
+* the page right now instead of doing a second scan.

How about:

If there are no indexes or index cleanup is disabled, we can
vacuum the page right now instead of doing a second scan.

+   /*
+* Here, we have indexes but index vacuum is 
disabled. We don't
+* vacuum dead tuples on heap but forget them 
as we skip index
+* vacuum. The vacrelstats->dead_tuples could 
have tuples which
+* became dead after checked at HOT-pruning 
time but aren't marked
+* as dead yet. We don't process them because 
it's a very rare
+* condition and the next vacuum will process 
them.
+*/

I would suggest a few small changes:

/*
 * Here, we have indexes but index vacuum is disabled.  Instead of
 * vacuuming the dead tuples on the heap, we just forget them.
 *
 * Note that vacrelstats->dead_tuples could include tuples which
 * became dead after HOT-pruning but are not marked dead yet.  We
 * do not process them because this is a very rare condition, and
 * the next vacuum will process them anyway.
 */

-   /* If no indexes, make log report that lazy_vacuum_heap would've made */
+   /*
+* If no index or disables index vacuum, make log report that 
lazy_vacuum_heap
+* would

Re: jsonpath

2019-03-04 Thread Tomas Vondra
A bunch of additional comments, after looking at the patch a bit today.
All are mostly minor, and sometime perhaps a matter of preference.


1) There's a mismatch between the comment and actual function name for
jsonb_path_match_opr and jsonb_path_exists_opr(). The comments say
"_novars" instead.

2) In a couple of switches the "default" case does a return with a
value, following elog(ERROR). So it's practically unreachable, AFAICS
it's fine without it, and we don't do this elsewhere. And I don't get
any compiler warnings if I remove it either.

Examples:

JsonbTypeName

default:
elog(ERROR, "unrecognized jsonb value type: %d", jbv->type);
return "unknown";

jspOperationName

default:
elog(ERROR, "unrecognized jsonpath item type: %d", type);
return NULL;

compareItems

default:
elog(ERROR, "unrecognized jsonpath operation: %d", op);
return jpbUnknown;

3) jsonpath_send is using makeStringInfo() for a value that is not
returned - IMHO it should use regular stack-allocated variable and use
initStringInfo() instead

4) the version number should be defined/used as a constant, not as a
magic constant somewhere in the code

5) Why does jsonPathToCstring do this?

appendBinaryStringInfo(out, "strict ", 7);

Why not to use regular appendStringInfoString()? What am I missing?

6) comment typo: "Aling StringInfo"

7) alignStringInfoInt() should explain why we need this and why INTALIGN
is the right alignment.

8) I'm a bit puzzled by what flattenJsonPathParseItem does with 'next'

I don't quite understand what it's doing with 'next' value?

/*
 * Actual value will be recorded later, after next and children
 * processing.
 */
appendBinaryStringInfo(buf,
   (char *) &next, /* fake value */
   sizeof(next));

Perhaps a comment explaining it (why we need a fake value at all?) would
be a good idea here.

9) I see printJsonPathItem is only calling check_stack_depth while
flattenJsonPathParseItem also calls CHECK_INTERRUPTS. Why the
difference, considering they seem about equally expensive?

10) executeNumericItemMethod is missing a comment (unlike the other
executeXXX functions)

11) Wording of some of the error messages in the execute methods seems a
bit odd. For example executeNumericItemMethod may complain that it

... is applied to not a numeric value

but perhaps a more natural wording would be

... is applied to a non-numeric value

And similarly for the other execute methods. But I'm not a native
speaker, so perhaps the original wording is just fine.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: converting Lists into arrays

2019-03-04 Thread David Rowley
On Tue, 5 Mar 2019 at 11:11, Andres Freund  wrote:
> On 2019-03-04 16:28:40 -0500, Tom Lane wrote:
> > Andres Freund  writes:
> > > I don't buy this. I think e.g. redisgning the way we represent
> > > targetlists would be good (it's e.g. insane that we recompute
> > > descriptors out of them all the time), and would reduce their allocator
> > > costs.
> >
> > Maybe we're not on the same page here, but it seems to me that that'd be
> > addressable with pretty localized changes (eg, adding more fields to
> > TargetEntry, or keeping a pre-instantiated output tupdesc in each Plan
> > node).  But if the concern is about the amount of palloc bandwidth going
> > into List cells, we're not going to be able to improve that with localized
> > data structure changes; it'll take something like the patch I've proposed.
>
> What I'm saying is that it'd be reasonable to replace the use of list
> for targetlists with 'list2' without a wholesale replacement of all the
> list code, and it'd give us benefits.

So you think targetlists are the only case to benefit from an array
based list? (Ignoring the fact that I already showed another case)
When we discover the next thing to benefit, then the replacement will
be piecemeal, just the way Tom would rather not do it.  I personally
don't want to be up against huge resistance when I discover that
turning a single usage of a List into List2 is better.  We'll need to
consider backpatching pain / API breakage *every single time*.

A while ago I did have a go at changing some List implementations for
my then proposed ArrayList and it was beyond a nightmare, as each time
I changed one I realised I needed to change another. In the end, I
just gave up. Think of all the places we have forboth() and
forthree(), we'll need to either provide a set of macros that take
various combinations of List and List2 or do some conversion
beforehand.  With respect, if you don't believe me, please take my
ArrayList patch [1] and have a go at changing targetlists to use
ArrayLists all the way from the parser through to the executor. I'll
be interested in the diff stat once you're done.

It's true that linked lists are certainly better for some stuff;
list_concat() is going to get slower, lcons() too, but likely we can
have a bonus lcons() elimination round at some point. I see quite a
few of them that look like they could be changed to lappend().  I also
just feel that if we insist on more here then we'll get about nothing.
I'm also blocked on my partition performance improvement goals on
list_nth() being O(N), so I'm keen to see progress here and do what I
can to help with that.  With list_concat() I find that pretty scary
anyway. Using it means we can have a valid list that does not get it's
length updated when someone appends a new item. Most users of that do
list_copy() to sidestep that and other issues... which likely is
something we'd want to rip out with Tom's patch.

[1] 
https://www.postgresql.org/message-id/CAKJS1f_2SnXhPVa6eWjzy2O9A=ocwgd0cj-lqewpgtrwqbu...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Inheriting table AMs for partitioned tables

2019-03-04 Thread Andres Freund
Hi,

In the pluggable storage patch [1], one thing that I'm wondering about
is how exactly to inherit the storage AM across partitions. I think
that's potentially worthy of a discussion with a wider audience than I'd
get in that thread.  It seems also related to the recent discussion in [2]

Consider (excerpted from the tests):

CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
heap2;

SET default_table_access_method = 'heap';
CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('a');

SET default_table_access_method = 'heap2';
CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('b');

CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('c') USING heap;
CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
VALUES IN ('d') USING heap2;

It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
would be stored via heap2, and tableam_parted_c_heap2 via heap.

But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
quite as clear.  I think it'd both be sensible for new partitions to
inherit the AM from the root, but it'd also be sensible to use the
current default.

Out of laziness (it's how it works rn) I'm inclined to to go with using
the current default, but I'd be curious if others disagree.

Greetings,

Andres Freund

[1] https://postgr.es/m/20180703070645.wchpu5muyto5n647%40alap3.anarazel.de
[2] 
https://www.postgresql.org/message-id/201902041630.gpadougzab7v%40alvherre.pgsql



Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-03-04 Thread Bruce Momjian
On Thu, Feb 28, 2019 at 10:28:44AM +0300, Sergei Kornilov wrote:
> Hello
> 
> postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
> ab1d9f066aee4f9b81abde6136771debe0191ae8)
> So config will be changed in next minor release anyway. We have another 
> reason to not fix bgwriter_lru_maxpages comment?

Frankly, I was surprised postgresql.conf.sample was changed in a back
branch since it will cause people who diff $PGDATA/postgresql.conf with
share/postgresql.conf.sample to see differences they didn't make.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +



Re: POC: converting Lists into arrays

2019-03-04 Thread Andres Freund
Hi,

On 2019-03-05 12:42:47 +1300, David Rowley wrote:
> So you think targetlists are the only case to benefit from an array
> based list? (Ignoring the fact that I already showed another case)

No, that's not what I'm trying to say at all. I think there's plenty
cases where it'd be beneficial. In this subthread we're just arguing
whether it's somewhat feasible to not change everything, and I'm still
fairly convinced that's possible; but I'm not arguing that that's the
best way.


> It's true that linked lists are certainly better for some stuff;
> list_concat() is going to get slower, lcons() too, but likely we can
> have a bonus lcons() elimination round at some point. I see quite a
> few of them that look like they could be changed to lappend().  I also
> just feel that if we insist on more here then we'll get about nothing.
> I'm also blocked on my partition performance improvement goals on
> list_nth() being O(N), so I'm keen to see progress here and do what I
> can to help with that.  With list_concat() I find that pretty scary
> anyway. Using it means we can have a valid list that does not get it's
> length updated when someone appends a new item. Most users of that do
> list_copy() to sidestep that and other issues... which likely is
> something we'd want to rip out with Tom's patch.

Yes, I think you have a point that progress here would be good and that
it's worth some pain. But the names will make even less sense if we just
shunt in an array based approach under the already obscure list
API. Obviously the individual pain of that is fairly small, but over the
years and everybody reading PG code, it's also substantial.  So I'm
torn.

Greetings,

Andres Freund



RE: pgbench - doCustom cleanup

2019-03-04 Thread Jamison, Kirk
Hi Fabien, 

>> See the attached latest patch.
>> The attached patch applies, builds cleanly, and passes the regression 
>> tests.
>
> No problems on my part as I find the changes logical.
> This also needs a confirmation from Alvaro.
>
> Ok.
>
> You switched the patch to "waiting on author": What are you waiting from me?
>
> If you think that it is ok and that it should be considered by a committer,
> probably Alvaro, it can be marked as "ready".

Indeed, it was my mistake.
I have already marked it as "Ready for Committer".

Regards,
Kirk Jamison




Re: pg_dump multi VALUES INSERT

2019-03-04 Thread David Rowley
Thanks for looking at this again.

On Sat, 2 Mar 2019 at 20:01, Fabien COELHO  wrote:
> Although I'm all in favor of checking the int associated to the option, I
> do not think that it warrants three checks and messages. I would suggest
> to factor them all as just one check and one (terse) message.

Yeah. I've been trying to keep that area sane for a while, so I agree
that one message is fine. Done that way in the attached and put back
the missing ERANGE check.

> Option "--help" line: number of row*s* ?

Oops. Fixed.

> About the output: I'd suggest to indent one line per row, something like:
>
>INSERT INTO foo VALUES
>  (..., ..., ..., ...),
>  (..., ..., ..., ...),
>  (..., ..., ..., ...);

Reasonable. Change it to that. I put a tab at the start of those
lines.   There's still the possibility that one 1 final row makes up
the final INSERT.  These will still span multiple lines. I don't think
there's anything that can reasonably be done about that.

> I'd suggest to add test tables with (1) no rows and (2) no columns but a
> few rows, with multiple --table options.

I didn't do that. I partially think that you're asking for tests to
test existing behaviour and partly because perl gives me a sore head.
Maybe Surafel want to do that?

v17 attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pg_dump-rows-per-insert-option-v17.patch
Description: Binary data


GSoC 2019 - TOAST'ing in slices idea

2019-03-04 Thread Bruno Hass

Hello everyone,

I am currently writing my proposal for GSoC 2019 for the TOAST'ing in slices 
idea.
 I already have a sketch of the description and approach outline, which I am 
sending in this e-mail. I would be happy to receive some feedback on it. I've 
been reading the PostgreSQL source code and documentation in order to 
understand better how to implement this idea and to write a detailed proposal. 
I would also be glad to receive some recommendation of documentation or 
material on TOAST'ing internals as well as some hint on where to look further 
in the source code.


I am looking forward to your feedback!


PostgreSQL Proposal.pdf
Description: PostgreSQL Proposal.pdf


Re: Should we increase the default vacuum_cost_limit?

2019-03-04 Thread Peter Geoghegan
On Mon, Feb 25, 2019 at 8:48 AM Robert Haas  wrote:
> +1 for raising the default substantially.  In my experience, and it
> seems others are in a similar place, nobody ever gets into trouble
> because the default is too high, but sometimes people get in trouble
> because the default is too low.

Does anyone want to make an argument against the idea of raising the
default? They should speak up now.

-- 
Peter Geoghegan



Re: POC: converting Lists into arrays

2019-03-04 Thread David Rowley
On Tue, 5 Mar 2019 at 12:54, Andres Freund  wrote:
> Yes, I think you have a point that progress here would be good and that
> it's worth some pain. But the names will make even less sense if we just
> shunt in an array based approach under the already obscure list
> API.

If we feel strongly about fixing that then probably it would be as
simple as renaming the functions and adding some macros with the old
names and insisting that all new or changed code use the functions and
not the macro wrappers. That could be followed up by a final sweep in
N years time when the numbers have dwindled to a low enough level. All
that code mustn't be getting modified anyway, so not much chance
backpatching pain.

I see length() finally died in a similar way in Tom's patch.  Perhaps
doing this would have people consider lcons more carefully before they
use it over lappend.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Allowing extensions to supply operator-/function-specific info

2019-03-04 Thread Tom Lane
Paul Ramsey  writes:
>> On Mar 4, 2019, at 2:52 PM, Tom Lane  wrote:
>> BTW, if you'd like me to review the code you added for this, I'd be happy
>> to do so.  I've never looked at PostGIS' innards, but probably I can make
>> sense of the code for this despite that.

> I would be ecstatic for a review, I'm sure I've left a million loose threads 
> dangling.

I took a look, and saw that you'd neglected to check pseudoconstantness
of the non-index argument, so this'd fail on cases like ST_DWithin(x, y)
where x is indexed and y is another column in the same table.  Also
I thought the handling of commutation could be done better.  Attached is
a suggested patch atop your f731c1b7022381dbf627cae311c3d37791bf40c3 to
fix those and a couple of nitpicky other things.  (I haven't tested this,
mind you.)

One thing that makes me itch, but I didn't address in the attached,
is that expandFunctionOid() is looking up a function by name without
any schema-qualification.  That'll fail outright if PostGIS isn't in
the search path, and even if it is, you've got security issues there.
One way to address this is to assume that the expandfn is in the same
schema as the ST_XXX function you're attached to, so you could
do "get_namespace_name(get_func_namespace(funcid))" and then include
that in the list passed to LookupFuncName.

Also, this might be as-intended but I was wondering: I'd sort of expected
you to make, eg, _ST_DWithin() and ST_DWithin() into exact synonyms.
They aren't, since the former is not connected to the support function.
Is that intentional?  I guess if you had a situation where you wanted to
force non-use of an index, being able to use _ST_DWithin() for that would
be helpful.

regards, tom lane

diff --git a/postgis/gserialized_supportfn.c b/postgis/gserialized_supportfn.c
index 90c61f3..66b699b 100644
--- a/postgis/gserialized_supportfn.c
+++ b/postgis/gserialized_supportfn.c
@@ -60,7 +60,7 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS);
 */
 typedef struct
 {
-	char *fn_name;
+	const char *fn_name;
 	int   strategy_number; /* Index strategy to add */
 	int   nargs;   /* Expected number of function arguments */
 	int   expand_arg;  /* Radius argument for "within distance" search */
@@ -230,22 +230,42 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS)
 }
 
 /*
-* Somehow an indexed third argument has slipped in. That
-* should not happen.
+* We can only do something with index matches on the first or
+* second argument.
 */
 if (req->indexarg > 1)
 	PG_RETURN_POINTER((Node *)NULL);
 
 /*
-* Need the argument types (which should always be geometry or geography
-* since this function is only bound to those functions)
-* to use in the operator function lookup
+* Make sure we have enough arguments (just paranoia really).
 */
-if (nargs < 2)
+if (nargs < 2 || nargs < idxfn.expand_arg)
 	elog(ERROR, "%s: associated with function with %d arguments", __func__, nargs);
 
-leftarg = linitial(clause->args);
-rightarg = lsecond(clause->args);
+/*
+* Extract "leftarg" as the arg matching the index, and
+* "rightarg" as the other one, even if they were in the
+* opposite order in the call.  NOTE: the functions we deal
+* with here treat their first two arguments symmetrically
+* enough that we needn't distinguish the two cases beyond
+* this.  This might need more work later.
+*/
+if (req->indexarg == 0)
+{
+	leftarg = linitial(clause->args);
+	rightarg = lsecond(clause->args);
+}
+else
+{
+	rightarg = linitial(clause->args);
+	leftarg = lsecond(clause->args);
+}
+
+/*
+* Need the argument types (which should always be geometry or geography
+* since this function is only bound to those functions)
+* to use in the operator function lookup.
+*/
 leftdatatype = exprType(leftarg);
 rightdatatype = exprType(rightarg);
 
@@ -267,20 +287,27 @@ Datum postgis_index_supportfn(PG_FUNCTION_ARGS)
 */
 if (idxfn.expand_arg)
 {
-	Node *indexarg = req->indexarg ? rightarg : leftarg;
-	Node *otherarg = req->indexarg ? leftarg : rightarg;
 	Node *radiusarg = (Node *) list_nth(clause->args, idxfn.expand_arg-1);
-	// Oid indexdatatype = exprType(indexarg);
-	Oid otherdatatype = exprType(otherarg);
-	Oid expandfn_oid = expandFunctionOid(otherdatatype);
+	Oid expandfn_oid = expandFunctionOid(rightdatatype);
+
+	FuncExpr *expandexpr = makeFuncExpr(expandfn_oid, rightdatatype,
+	list_make2(rightarg, radiusarg),
+		InvalidOid, InvalidOid, COERCE_EXPLICIT_CALL);
 
-	FuncExpr *expandexpr = makeFuncExpr(expandfn_oid, otherdatatype,
-	list_make2(otherarg, radiusarg),
-		InvalidOid, req->indexcollation, COERCE_EXPLICIT_CALL);
+	/*
+	* The comparison expression has to be a pseudoconstant,
+	* ie not volatile nor dependent on the target index's
+	* tabl

Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane
David Rowley  writes:
> ...  With list_concat() I find that pretty scary
> anyway. Using it means we can have a valid list that does not get it's
> length updated when someone appends a new item. Most users of that do
> list_copy() to sidestep that and other issues... which likely is
> something we'd want to rip out with Tom's patch.

Yeah, it's a bit OT for this patch, but I'd noticed the prevalence of
locutions like list_concat(list_copy(list1), list2), and been thinking
of proposing that we add some new primitives with, er, less ad-hoc
behavior.  The patch at hand already changes the semantics of list_concat
in a somewhat saner direction, but I think there is room for a version
of list_concat that treats both its inputs as const Lists.

regards, tom lane



Re: POC: converting Lists into arrays

2019-03-04 Thread Tom Lane
David Rowley  writes:
> On Tue, 5 Mar 2019 at 12:54, Andres Freund  wrote:
>> Yes, I think you have a point that progress here would be good and that
>> it's worth some pain. But the names will make even less sense if we just
>> shunt in an array based approach under the already obscure list
>> API.

> If we feel strongly about fixing that then probably it would be as
> simple as renaming the functions and adding some macros with the old
> names and insisting that all new or changed code use the functions and
> not the macro wrappers.

Meh ... Neil Conway already did a round of that back in 2004 or whenever,
and I'm not especially excited about another round.  I'm not really
following Andres' aversion to the list API --- it's not any more obscure
than a whole lot of things in Postgres.  (Admittedly, as somebody who
dabbled in Lisp decades ago, I might be more used to it than some folks.)

regards, tom lane



Re: speeding up planning with partitions

2019-03-04 Thread Amit Langote
Hi Jesper,

Thanks for the review.  I've made all of the changes per your comments in
my local repository.  I'll post the updated patches after diagnosing what
I'm suspecting a memory over-allocation bug in one of the patches.  If you
configure build with --enable-cassert, you'll see that throughput
decreases as number of partitions run into many thousands, but it doesn't
when asserts are turned off.

On 2019/03/05 1:20, Jesper Pedersen wrote:
> I'll run some tests using a hash partitioned setup.

Thanks.

Regards,
Amit




Re: Ordered Partitioned Table Scans

2019-03-04 Thread David Rowley
Thanks a lot for taking the time to look at this.

On Tue, 5 Mar 2019 at 03:03, Antonin Houska  wrote:
> As for the latest version (v8-0001-...) I've only caught a small typo: "When
> the first subpath needs sorted ...". It was probably meant "... needs sort
> ...".

That was a sort of short way of saying "needs [to be] sorted". I've
added in the missing "to be" in the attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


v9-0001-Allow-Append-to-be-used-in-place-of-MergeAppend-f.patch
Description: Binary data


Re: psql display of foreign keys

2019-03-04 Thread Amit Langote
On 2019/03/05 4:41, Alvaro Herrera wrote:
> Here's the patch I'm really interested about :-)

Thanks for the updated patch.  I applied it and rebased the
foreign-keys-referencing-partitioned-tables patch on top.  Here's
something I think you may have missed:

-- partitioned primary key table
create table p (a int primary key) partition by list (a);
create table p1 partition of p for values in (1) partition by list (a);
create table p11 partition of p1 for values in (1);

-- regular primary key table
create table pk (a int primary key);

-- another partitioned table to define FK on
create table q (a int) partition by list (a);
create table q1 partition of q for values in (1) partition by list (a);
create table q11 partition of q1 for values in (1);

-- FK on q referencing p
alter table q add foreign key (a) references p;

-- seems OK

\d p
   Partitioned table "public.p"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition key: LIST (a)
Indexes:
"p_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)
Number of partitions: 1 (Use \d+ to list them.)

\d p1
   Partitioned table "public.p1"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition of: p FOR VALUES IN (1)
Partition key: LIST (a)
Indexes:
"p1_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)
Number of partitions: 1 (Use \d+ to list them.)

\d p11
Table "public.p11"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Partition of: p1 FOR VALUES IN (1)
Indexes:
"p11_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES p(a)

-- change the FK to reference regular table
alter table q drop constraint q_a_fkey ;
alter table q add foreign key (a) references pk;

-- not OK?
\d pk
 Table "public.pk"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │ not null │
Indexes:
"pk_pkey" PRIMARY KEY, btree (a)
Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)
TABLE "q1" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)
TABLE "q11" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)

Shouldn't the above only list the constraint on q as follows?

Referenced by:
TABLE "q" CONSTRAINT "q_a_fkey" FOREIGN KEY (a) REFERENCES pk(a)


Maybe:

@@ -2488,7 +2488,8 @@ describeOneTableDetails(const char *schemaname,
   "SELECT conname,
conrelid::pg_catalog.regclass,\n"
   "
pg_catalog.pg_get_constraintdef(c.oid, true) as condef\n"
   "FROM pg_catalog.pg_constraint c\n"
-  "WHERE c.confrelid = '%s' AND c.contype
= 'f' ORDER BY 1;",
+  "WHERE c.confrelid = '%s' AND c.contype
= 'f' AND conparentid = 0\n"
+  "ORDER BY conname;",

Thanks,
Amit




Re: Inheriting table AMs for partitioned tables

2019-03-04 Thread Amit Kapila
On Tue, Mar 5, 2019 at 5:17 AM Andres Freund  wrote:
>
> Hi,
>
> In the pluggable storage patch [1], one thing that I'm wondering about
> is how exactly to inherit the storage AM across partitions. I think
> that's potentially worthy of a discussion with a wider audience than I'd
> get in that thread.  It seems also related to the recent discussion in [2]
>
> Consider (excerpted from the tests):
>
> CREATE TABLE tableam_parted_heap2 (a text, b int) PARTITION BY list (a) USING 
> heap2;
>
> SET default_table_access_method = 'heap';
> CREATE TABLE tableam_parted_a_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('a');
>
> SET default_table_access_method = 'heap2';
> CREATE TABLE tableam_parted_b_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('b');
>
> CREATE TABLE tableam_parted_c_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('c') USING heap;
> CREATE TABLE tableam_parted_d_heap2 PARTITION OF tableam_parted_heap2 FOR 
> VALUES IN ('d') USING heap2;
>
> It seems pretty clear that tableam_parted_heap2, tableam_parted_d_heap2
> would be stored via heap2, and tableam_parted_c_heap2 via heap.
>
> But for tableam_parted_a_heap2 tableam_parted_b_heap2 the answer isn't
> quite as clear.  I think it'd both be sensible for new partitions to
> inherit the AM from the root, but it'd also be sensible to use the
> current default.
>

Yeah, we can go either way.

> Out of laziness (it's how it works rn) I'm inclined to to go with using
> the current default, but I'd be curious if others disagree.
>

I think using the current default should be okay as that will be the
behavior for non-partitioned tables as well.  However, if people have
good reasons to go other way, then that is fine too.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



  1   2   >