Re: COPY FROM WHEN condition

2019-04-02 Thread Andres Freund
On 2019-04-03 18:44:32 +1300, David Rowley wrote:
> (Fixed all of what you mentioned)
> 
> On Wed, 3 Apr 2019 at 06:57, Andres Freund  wrote:
> > > +/*
> > > + * CopyMultiInsertInfo_Flush
> > > + *   Write out all stored tuples in all buffers out to the 
> > > tables.
> > > + *
> > > + * To save us from ending up with buffers for 1000s of partitions we 
> > > remove
> > > + * buffers belonging to partitions that we've seen no tuples for in this 
> > > batch
> >
> > That seems a little naive (imagine you have like 5 partitions, and we
> > constantly cycle through 2-3 of them per batch).  It's probably OK for
> > this version.   I guess I'd only do that cleanup if we're actually
> > flushing due to the number of partitions.
> 
> hmm good point.  It seems like being smarter there would be a good thing.
> 
> I've ended up getting rid of the hash table in favour of the List that
> you mentioned and storing the buffer in ResultRelInfo.


Cool.

> I also changed
> the logic that removes buffers once we reach the limit.  Instead of
> getting rid of buffers that were not used on this run, I've changed it
> so it just gets rid of the buffers starting with the oldest one first,
> but stops once the number of buffers is at the maximum again.  This
> can mean that we end up with MAX_BUFFERED_TUPLES buffers instead of
> MAX_PARTITION_BUFFERS if there is only 1 tuple per buffer.  My current
> thinking is that this does not matter since only 1 slot will be
> allocated per buffer.  We'll remove all of the excess buffers during
> the flush and keep just MAX_PARTITION_BUFFERS of the newest buffers.

Yea, that makes sense to me.


> Also, after changing CopyMultiInsertBuffer to use fixed sized arrays
> instead of allocating them with another palloc the performance has
> improved a bit more.
> 
> Using the perl files mentioned in [1]
> 
> Master + Patched:
> # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|';
> COPY 35651564
> Time: 9106.776 ms (00:09.107)
> # truncate table listp;
> TRUNCATE TABLE
> # copy listp from program $$perl ~/bench.pl$$ delimiter '|';
> COPY 35651564
> Time: 10154.196 ms (00:10.154)
> 
> 
> Master only:
> # copy listp from program $$perl ~/bench_same.pl$$ delimiter '|';
> COPY 35651564
> Time: 22200.535 ms (00:22.201)
> # truncate table listp;
> TRUNCATE TABLE
> # copy listp from program $$perl ~/bench.pl$$ delimiter '|';
> COPY 35651564
> Time: 18592.107 ms (00:18.592)

Awesome. I'm probably too tired to give you meaningful feedback
tonight. But I'll do a quick readthrough.



> +/* Class the buffer full if there are >= this many bytes of tuples stored */
> +#define MAX_BUFFERED_BYTES   65535

Random aside: This seems pretty small (but should be changed separately.


> +/* Trim the list of buffers back down to this number after flushing */
> +#define MAX_PARTITION_BUFFERS32
> +
> +/* Stores multi-insert data related to a single relation in CopyFrom. */
> +typedef struct CopyMultiInsertBuffer
> +{
> + TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
> + ResultRelInfo *resultRelInfo;   /* ResultRelInfo for 'relid' */
> + BulkInsertState bistate;/* BulkInsertState for this rel */
> + int nused;  /* number of 'slots' 
> containing tuples */
> + uint64  linenos[MAX_BUFFERED_TUPLES];   /* Line # of tuple in 
> copy
> + 
>  * stream */
> +} CopyMultiInsertBuffer;

I don't think it's needed now, but you'd probably achieve a bit better
locality by storing slots + linenos in a single array of (slot,lineno).

> +/*
> + * CopyMultiInsertBuffer_Init
> + *   Allocate memory and initialize a new CopyMultiInsertBuffer for 
> this
> + *   ResultRelInfo.
> + */
> +static CopyMultiInsertBuffer *
> +CopyMultiInsertBuffer_Init(ResultRelInfo *rri)
> +{
> + CopyMultiInsertBuffer *buffer;
> +
> + buffer = (CopyMultiInsertBuffer *) 
> palloc(sizeof(CopyMultiInsertBuffer));
> + memset(buffer->slots, 0, sizeof(TupleTableSlot *) * 
> MAX_BUFFERED_TUPLES);

Is there a reason to not just directly palloc0?


> +/*
> + * CopyMultiInsertBuffer_Cleanup
> + *   Drop used slots and free member for this buffer.  The buffer
> + *   must be flushed before cleanup.
> + */
> +static inline void
> +CopyMultiInsertBuffer_Cleanup(CopyMultiInsertBuffer *buffer)
> +{
> + int i;
> +
> + /* Ensure buffer was flushed */
> + Assert(buffer->nused == 0);
> +
> + /* Remove back-link to ourself */
> + buffer->resultRelInfo->ri_CopyMultiInsertBuffer = NULL;
> +
> + ReleaseBulkInsertStatePin(buffer->bistate);

Hm, afaict this still leaks the bistate itself?


Greetings,

Andres Freund




Re: Ordered Partitioned Table Scans

2019-04-02 Thread David Rowley
On Wed, 3 Apr 2019 at 14:01, Amit Langote  wrote:
> I don't have any more comments.

Great. Many thanks for having a look at this.  Going by [1], Julien
also seems pretty happy, so I'm going to change this over to ready for
committer.

[1] 
https://www.postgresql.org/message-id/caobau_yptqbfqcp5jyjzetpl6mgyutwvtvvbzkzkc6xdbtd...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: COPY FROM WHEN condition

2019-04-02 Thread David Rowley
(Fixed all of what you mentioned)

On Wed, 3 Apr 2019 at 06:57, Andres Freund  wrote:
> > +/*
> > + * CopyMultiInsertInfo_Flush
> > + *   Write out all stored tuples in all buffers out to the tables.
> > + *
> > + * To save us from ending up with buffers for 1000s of partitions we remove
> > + * buffers belonging to partitions that we've seen no tuples for in this 
> > batch
>
> That seems a little naive (imagine you have like 5 partitions, and we
> constantly cycle through 2-3 of them per batch).  It's probably OK for
> this version.   I guess I'd only do that cleanup if we're actually
> flushing due to the number of partitions.

hmm good point.  It seems like being smarter there would be a good thing.

I've ended up getting rid of the hash table in favour of the List that
you mentioned and storing the buffer in ResultRelInfo. I also changed
the logic that removes buffers once we reach the limit.  Instead of
getting rid of buffers that were not used on this run, I've changed it
so it just gets rid of the buffers starting with the oldest one first,
but stops once the number of buffers is at the maximum again.  This
can mean that we end up with MAX_BUFFERED_TUPLES buffers instead of
MAX_PARTITION_BUFFERS if there is only 1 tuple per buffer.  My current
thinking is that this does not matter since only 1 slot will be
allocated per buffer.  We'll remove all of the excess buffers during
the flush and keep just MAX_PARTITION_BUFFERS of the newest buffers.

Also, after changing CopyMultiInsertBuffer to use fixed sized arrays
instead of allocating them with another palloc the performance has
improved a bit more.

Using the perl files mentioned in [1]

Master + Patched:
# copy listp from program $$perl ~/bench_same.pl$$ delimiter '|';
COPY 35651564
Time: 9106.776 ms (00:09.107)
# truncate table listp;
TRUNCATE TABLE
# copy listp from program $$perl ~/bench.pl$$ delimiter '|';
COPY 35651564
Time: 10154.196 ms (00:10.154)


Master only:
# copy listp from program $$perl ~/bench_same.pl$$ delimiter '|';
COPY 35651564
Time: 22200.535 ms (00:22.201)
# truncate table listp;
TRUNCATE TABLE
# copy listp from program $$perl ~/bench.pl$$ delimiter '|';
COPY 35651564
Time: 18592.107 ms (00:18.592)

> >   if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
> >   {
> >   ereport(ERROR,
> > - (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > - errmsg("cannot perform FREEZE on a 
> > partitioned table")));
> > + 
> > (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> > +  errmsg("cannot perform FREEZE on a 
> > partitioned table")));
> >   }
>
> accidental hunk?

It was left over from a pgindent run. Now removed.

[1] 
https://www.postgresql.org/message-id/CAKJS1f98Fa+QRTGKwqbtz0M=cy1ehyr8q-w08cpa78toy4e...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


david_tableam_copy_v3.patch
Description: Binary data


Re: pg_basebackup ignores the existing data directory permissions

2019-04-02 Thread Haribabu Kommi
On Fri, Mar 29, 2019 at 9:05 PM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-03-26 03:26, Michael Paquier wrote:
> > Do we really want to extend the replication protocol to control that?
>
> Perhaps we are losing sight of the original problem, which is that if
> you create the target directory with the wrong permissions then ... it
> has the wrong permissions.  And you are free to change the permissions
> at any time.  Many of the proposed solutions sound excessively
> complicated relative to that.
>

Yes, I agree that the proposed solution for fixing the original problem of
existing
data directory permissions with new group options to pg_basebackup is a
big.

why can't we just fix the permissions of the directory by default as per the
source instance? I feel with this change, it may be useful to many people
than problem.

>From understanding of the thread discussion,

+1 by:

Michael Paquier
Robert Haas
Haribabu Kommi

-1 by:

Magnus Hagander
Peter Eisentraut

Does any one want to weigh their opinion on this patch to consider best
option for controlling the existing standby data directory permissions.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: New vacuum option to do only freezing

2019-04-02 Thread Masahiko Sawada
On Wed, Apr 3, 2019 at 1:17 PM Kyotaro HORIGUCHI
 wrote:
>
> At Wed, 3 Apr 2019 12:10:03 +0900, Masahiko Sawada  
> wrote in 
> > > And in the following part:
> > >
> > > +   /* Set index cleanup option based on reloptions */
> > > +   if (params->index_cleanup == VACUUM_OPTION_DEFAULT)
> > > +   {
> > > +   if (onerel->rd_options == NULL ||
> > > +   ((StdRdOptions *) 
> > > onerel->rd_options)->vacuum_index_cleanup)
> > > +   params->index_cleanup = VACUUM_OPTION_ENABLED;
> > > +   else
> > > +   params->index_cleanup = VACUUM_OPTION_DISABLED;
> > > +   }
> > > +
> > >
> > > The option should not be false while VACUUM FULL,
> >
> > I think that we need to complain only when INDEX_CLEANUP option is
> > disabled by an explicit option on the VACUUM command and FULL option
> > is specified. It's no problem when vacuum_index_cleanup is false and
> > FULL option is true. Since internally we don't use index cleanup when
> > vacuum full I guess that we don't need to require index_cleanup being
> > always true even when full option is specified.
>
> I know it's safe. It's just about integrity of option values. So
> I don't insist on that.
>
> > > and maybe we
> > > should complain in WARNING or NOTICE that the relopt is ignored.
> >
> > I think when users want to control index cleanup behavior manually
> > they specify INDEX_CLEANUP option on the VACUUM command. So it seems
> > to me that overwriting a reloption by an explicit option would be a
> > natural behavior. I'm concerned that these message would rather
> > confuse users.
>
> If it "cannot be specified with FULL", it seems strange that it's
> safe being specified by reloptions.
>
> I'm rather thinking that INDEX_CLEANUP = false is ignorable even
> being specified with FULL option, aand DISABLE_PAGE_SKIPPING for
> VACUUM FULL shuld be ignored since VACUUM FULL doesn't skip pages
> in the first place.
>
> Couldn't we silence the DISABLE_PAGE_SKIPPING & FULL case instead
> of complaining about INDEX_CLEANUP & FULL? If so, I feel just
> ignoring the relopt cases is reasonable.

Agreed with being silent even when INDEX_CLEANUP/vacuum_index_cleanup
= false and FULL = true. For  DISABLE_PAGE_SKIPPING, it should be a
separate patch and changes the existing behavior. Maybe need other
discussion.

For VacOptTernaryValue part, I've incorporated the review comments but
left the new enum type since it seems to be more straightforward for
now. I might change that if there are other way.

Attached the updated version patches including the
DISABLE_PAGE_SKIPPING part (0003).

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From 48ad0ab24084833b888b9a0f5a7ce1a55fb24775 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Thu, 7 Mar 2019 09:45:11 +0900
Subject: [PATCH v13 1/3] Add INDEX_CLEANUP option to VACUUM command

If this option is false, VACUUM does HOT-pruning for live tuples but
doesn't remove dead tuples completely and disables index vacuum.

vacrelstats->dead_tuples could have tuples that became dead after
checked at a HOT-pruning time, which are not marked as dead. Per
discussion on pgsql-hackers We normally records and remove them but
with this option we don't process and leave for the next vacuum for
simplifing the code. That's okay because it's very rare condition and
those tuples will be processed by the next vacuum.
---
 doc/src/sgml/ref/create_table.sgml | 15 +++
 doc/src/sgml/ref/vacuum.sgml   | 23 ++
 src/backend/access/common/reloptions.c | 13 +-
 src/backend/access/heap/vacuumlazy.c   | 80 ++
 src/backend/commands/vacuum.c  | 28 
 src/backend/postmaster/autovacuum.c|  1 +
 src/bin/psql/tab-complete.c|  6 ++-
 src/include/commands/vacuum.h  | 15 +++
 src/include/utils/rel.h|  1 +
 src/test/regress/expected/vacuum.out   |  9 
 src/test/regress/sql/vacuum.sql| 10 +
 11 files changed, 181 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 0fcbc66..910db52 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1378,6 +1378,21 @@ WITH ( MODULUS numeric_literal, REM

 

+vacuum_index_cleanup (boolean)
+
+ 
+  Enables or disables index cleanup when VACUUM is
+  run on this table.  The default value is true.
+  Disabling index cleanup can speed of VACUUM very
+  significantly, but may also lead to severely bloated indexes if table
+  modifications are frequent.  The INDEX_CLEANUP
+  parameter to , if specified, overrides
+  the value of this option.
+ 
+
+   
+
+   
 autovacuum_vacuum_threshold, toast.autovacuum_vacuum_threshold (integer)
 
  
diff --git a/doc/src/sgml/ref/vacuum.sgml 

Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-02 Thread Haribabu Kommi
On Wed, Apr 3, 2019 at 1:59 PM Amit Kapila  wrote:

> On Thu, Mar 28, 2019 at 11:43 AM Haribabu Kommi
>  wrote:
> >
> > On Wed, Mar 27, 2019 at 11:27 PM Amit Kapila 
> wrote:
> >>
> >> On Wed, Mar 27, 2019 at 6:53 AM Haribabu Kommi <
> kommi.harib...@gmail.com> wrote:
> >> > On Tue, Mar 26, 2019 at 9:08 PM Amit Kapila 
> wrote:
> >> >>
> >> >>   As part of this thread, maybe we can
> >> >> just fix the case of the parallel cooperating transaction.
> >> >
> >> >
> >> > With the current patch, all the parallel implementation transaction
> are getting
> >> > skipped, in my tests parallel workers are the major factor in the
> transaction
> >> > stats counter. Even before parallelism, the stats of the autovacuum
> and etc
> >> > are still present but their contribution is not greatly influencing
> the stats.
> >> >
> >> > I agree with you in fixing the stats with parallel workers and
> improve stats.
> >> >
> >>
> >> I was proposing to fix only the transaction started with
> >> StartParallelWorkerTransaction by using is_parallel_worker flag as
> >> discussed above.  I understand that it won't completely fix the
> >> problem reported by you, but it will be a step in that direction.  My
> >> main worry is that if we fix it the way you are proposing and we also
> >> invent a new way to deal with all other internal transactions, then
> >> the fix done by us now needs to be changed/reverted.  Note, that this
> >> fix needs to be backpatched as well, so we should avoid doing any fix
> >> which needs to be changed or reverted.
> >
> >
> > I tried the approach as your suggested as by not counting the actual
> parallel work
> > transactions by just releasing the resources without touching the
> counters,
> > the counts are not reduced much.
> >
> > HEAD - With 4 parallel workers running query generates 13 stats ( 4 * 3
> + 1)
> > Patch -  With 4 parallel workers running query generates 9 stats ( 4 * 2
> + 1)
> > Old approach patch - With 4 parallel workers running query generates 1
> stat (1)
> >
> > Currently the parallel worker start transaction 3 times in the following
> places.
> > 1. InitPostgres
> > 2. ParallelWorkerMain (2)
> >
> > with the attached patch, we reduce one count from ParallelWorkerMain.
> >
>
> Right, that is expected from this fix.  BTW, why you have changed the
> approach in this patch to not count anything by the parallel worker as
> compared to the earlier version where you were just ignoring the stats
> for transactions.  As of now, either way is fine, but in future (after
> parallel inserts/updates), we want other stats to be updated.  I think
> your previous idea was better, just you need to start using
> is_parallel_worker flag.
>

Thanks for the review.

While changing the approach to use the is_parallel_worker_flag, I thought
that the rest of the stats are also not required to be updated and also
those
are any way write operations and those values are zero anyway for parallel
workers.

Instead of expanding the patch scope, I just changed to avoid the commit
or rollback stats as discussed, and later we can target the handling of all
the
internal transactions and their corresponding stats.

Regards,
Haribabu Kommi
Fujitsu Australia


0001-Avoid-counting-parallel-worker-start-transactions_v4.patch
Description: Binary data


Re: Re: A separate table level option to control compression

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 02:35:19PM +0900, Michael Paquier wrote:
> +   compress_tuple_threshold = RelationGetCompressTupleTarget(relation,
> +   toast_tuple_threshold);
> +   compress_tuple_threshold = Min(compress_tuple_threshold,
> +  toast_tuple_threshold);
> All the callers of RelationGetCompressTupleTarget directly compile the
> minimum between compress_tuple_threshold and toast_tuple_threshold,
> and also call RelationGetToastTupleTarget() beforehand.  Wouldn't it
> be better to merge all that in a common routine?  The same calculation
> method is duplicated 5 times.

I have been looking at this patch more, and here are some notes:
- The tests can be really simplified using directly reltoastrelid, so
I changed the queries this way.  I am aware that the surroundings
hardcode directly the relation name, but that's not really elegant in
my opinion.  And I am really tempted to adjust these as well to
directly use reltoastrelid.
- The docs had a weird indentation.
- I think that we should be careful with the portability of
pg_column_size(), so I have added comparisons instead of the direct
values in a way which does not change the meaning of the tests nor
their coverage.
- Having RelationGetCompressTupleTarget use directly
toast_tuple_threshold as default argument is I think kind of
confusing, so let's use a different static value, named
COMPRESS_TUPLE_TARGET in the attached.  This is similar to
TOAST_TUPLE_TARGET for the toast tuple threshold.
- The comments in tuptoaster.h need to be updated to outline the
difference between the compression invocation and the toast invocation
thresholds.  The wording could be better though.
- Better to avoid comments in the middle of the else/if blocks in my
opinion.

Also, the previous versions of the patch do that when doing a heap
insertion (heapam.c and rewriteheap.c):
+toast_tuple_threshold = RelationGetToastTupleTarget(relation,
+TOAST_TUPLE_THRESHOLD);
+compress_tuple_threshold = RelationGetCompressTupleTarget(relation,
+toast_tuple_threshold);
+compress_tuple_threshold = Min(compress_tuple_threshold,
+   toast_tuple_threshold);
[...]
need_toast = (HeapTupleHasExternal() ||
  HeapTupleHasExternal(newtup) ||
-newtup->t_len > TOAST_TUPLE_THRESHOLD);
+newtup->t_len > compress_tuple_threshold);

This means that the original code always uses the compilation-time,
default value of toast_tuple_target for all relations.  But this gets
changed so as we would use the value set at relation level for
toast_tuple_target if the reloption is changed, without touching
compress_tuple_threshold.  This is out of the scope of this patch, but
shouldn't we always use the relation-level value instead of the
compiled one?  Perhaps there is something I am missing?
--
Michael
diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 0fcbc660b3..acb858fba1 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1343,6 +1343,32 @@ WITH ( MODULUS numeric_literal, REM
 

 
+   
+compress_tuple_target (integer)
+
+ 
+  The compress_tuple_target parameter specifies the minimum tuple length
+  required before we try to compress columns marked as Extended or Main
+  and applies only to new tuples - there is no effect on existing rows.
+  By default this parameter is set to allow at least 4 tuples per block,
+  which with the default blocksize will be 2040 bytes. Valid values are
+  between 128 bytes and (blocksize - header), by default 8160 bytes.
+  If the specified value is greater than
+  toast_tuple_target, then we will use the current
+  setting of toast_tuple_target for
+  compress_tuple_target. Note that the default setting
+  is often close to optimal. If the value is set too low then the
+  TOAST may get invoked too often. If the
+  compressibility of the field values is not good, then compression and
+  decompression can add significant computation overhead without
+  corresponding savings in storage consumption.
+ 
+ 
+  This parameter cannot be set for TOAST tables.
+ 
+
+   
+

 parallel_workers (integer)
 
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 5df987f9c9..cfa0af571d 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -426,14 +426,17 @@ bytes regardless of the actual size of the represented value.
 
 The TOAST management code is triggered only
 when a row value to be stored in a table is wider than
-TOAST_TUPLE_THRESHOLD bytes (normally 2 kB).
-The TOAST code will compress and/or move
-field values out-of-line until the row value is shorter than
-TOAST_TUPLE_TARGET bytes (also normally 2 kB, adjustable)
-or no 

Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-04-02 Thread Pavan Deolasee
On Wed, Mar 27, 2019 at 9:47 AM Masahiko Sawada 
wrote:

>
> The patch looks good to me. There is no comment from me.
>
>
Thanks for your review! Updated patch attached since patch failed to apply
after recent changes in the master.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


copy_freeze_v5.patch
Description: Binary data


Re: Strange coding in _mdfd_openseg()

2019-04-02 Thread Kyotaro HORIGUCHI
At Wed, 3 Apr 2019 17:14:36 +1300, Thomas Munro  wrote 
in 
> Hello,
> 
> I think the following conditional code is misleading, and I wonder if
> it would be better like so:
> 
> --- a/src/backend/storage/smgr/md.c
> +++ b/src/backend/storage/smgr/md.c
> @@ -1787,8 +1787,13 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber
> forknum, BlockNumber segno,
> if (fd < 0)
> return NULL;
> 
> -   if (segno <= reln->md_num_open_segs[forknum])
> -   _fdvec_resize(reln, forknum, segno + 1);
> +   /*
> +* Segments are always opened in order from lowest to highest,
> so we must
> +* be adding a new one at the end.
> +*/
> +   Assert(segno == reln->md_num_open_segs[forknum]);
> +
> +   _fdvec_resize(reln, forknum, segno + 1);
> 
> /* fill the entry */
> v = >md_seg_fds[forknum][segno];
> 
> I think the condition is always true, and with == it would also always
> be true.  If that weren't the case, the call to _fdvec_resize() code
> would effectively leak vfds.

I may be missing something, but it seems possible that
_mdfd_getseg calls it with segno > opensegs.

| for (nextsegno = reln->md_num_open_segs[forknum];
|  nextsegno <= targetseg; nextsegno++)
| ...
| v = _mdfd_openseg(reln, forknum, nextsegno, flags);


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: ToDo: show size of partitioned table

2019-04-02 Thread Kyotaro HORIGUCHI
Hello.

At Wed, 3 Apr 2019 12:55:06 +0900, Amit Langote  
wrote in 
> On 2019/04/03 12:02, Kyotaro HORIGUCHI wrote:
> > \dPn doesn't show children because the are of 'r' relkind. And
> > \dPn * doesn't show type for children.
...
> I think it's intentional that leaf partitions are not shown, because the
> patch is meant to allow showing only the "partitioned" members of
> partition trees.  Regular \d[t|i][+] shows normal tables ('r', 'i', etc.)
> including their size, but it's useless for partitioned tables ('P', 'I',
> etc.) as far as showing the size is concerned, so the patch.  Even \dPn is
> meant to only show partitions that are themselves partitioned; note the
> "P" in the command.

+If the modifier n (nested) is used,
+then non-root partitioned tables are included, and a column is shown
+displaying the parent of each partitioned relation.

Ah. I see. "non-root *partitined* tables". I misread the
phrase. Sorry for the noise.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: New vacuum option to do only freezing

2019-04-02 Thread Kyotaro HORIGUCHI
At Wed, 3 Apr 2019 12:10:03 +0900, Masahiko Sawada  
wrote in 
> > And in the following part:
> >
> > +   /* Set index cleanup option based on reloptions */
> > +   if (params->index_cleanup == VACUUM_OPTION_DEFAULT)
> > +   {
> > +   if (onerel->rd_options == NULL ||
> > +   ((StdRdOptions *) 
> > onerel->rd_options)->vacuum_index_cleanup)
> > +   params->index_cleanup = VACUUM_OPTION_ENABLED;
> > +   else
> > +   params->index_cleanup = VACUUM_OPTION_DISABLED;
> > +   }
> > +
> >
> > The option should not be false while VACUUM FULL,
> 
> I think that we need to complain only when INDEX_CLEANUP option is
> disabled by an explicit option on the VACUUM command and FULL option
> is specified. It's no problem when vacuum_index_cleanup is false and
> FULL option is true. Since internally we don't use index cleanup when
> vacuum full I guess that we don't need to require index_cleanup being
> always true even when full option is specified.

I know it's safe. It's just about integrity of option values. So
I don't insist on that.

> > and maybe we
> > should complain in WARNING or NOTICE that the relopt is ignored.
> 
> I think when users want to control index cleanup behavior manually
> they specify INDEX_CLEANUP option on the VACUUM command. So it seems
> to me that overwriting a reloption by an explicit option would be a
> natural behavior. I'm concerned that these message would rather
> confuse users.

If it "cannot be specified with FULL", it seems strange that it's
safe being specified by reloptions.

I'm rather thinking that INDEX_CLEANUP = false is ignorable even
being specified with FULL option, aand DISABLE_PAGE_SKIPPING for
VACUUM FULL shuld be ignored since VACUUM FULL doesn't skip pages
in the first place.

Couldn't we silence the DISABLE_PAGE_SKIPPING & FULL case instead
of complaining about INDEX_CLEANUP & FULL? If so, I feel just
ignoring the relopt cases is reasonable.

Yeah, perhaps I'm warrying too much.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Strange coding in _mdfd_openseg()

2019-04-02 Thread Thomas Munro
Hello,

I think the following conditional code is misleading, and I wonder if
it would be better like so:

--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1787,8 +1787,13 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber
forknum, BlockNumber segno,
if (fd < 0)
return NULL;

-   if (segno <= reln->md_num_open_segs[forknum])
-   _fdvec_resize(reln, forknum, segno + 1);
+   /*
+* Segments are always opened in order from lowest to highest,
so we must
+* be adding a new one at the end.
+*/
+   Assert(segno == reln->md_num_open_segs[forknum]);
+
+   _fdvec_resize(reln, forknum, segno + 1);

/* fill the entry */
v = >md_seg_fds[forknum][segno];

I think the condition is always true, and with == it would also always
be true.  If that weren't the case, the call to _fdvec_resize() code
would effectively leak vfds.

-- 
Thomas Munro
https://enterprisedb.com




Re: ToDo: show size of partitioned table

2019-04-02 Thread Amit Langote
Horiguchi-san,

Thanks for taking a look.

On 2019/04/03 12:02, Kyotaro HORIGUCHI wrote:
> \dPn doesn't show children because the are of 'r' relkind. And
> \dPn * doesn't show type for children.
> 
> create table pa (a int, b int) partition by range (a);
> create table c1 partition of pa for values from (0) to (10);
> create table c2 partition of pa for values from (10) to (20);
> create index on pa(a);
> 
> postgres=# \dPn *
>   List of partitioned tables and indexes
>  Schema |   Name   |  Owner   |   Type| Parent name | On table 
> +--+--+---+-+--
>  public | pa   | horiguti | partitioned table | | 
>  public | pa_a_idx | horiguti | partitioned index | | pa
> (2 rows)
> 
> 
> With the attached on top of the latest patch, we get the following result.
> 
> postgres=# \dPn *
>   List of partitioned tables and indexes
>  Schema |   Name   |  Owner   |   Type| Parent name | On table 
> +--+--+---+-+--
>  public | c1   | horiguti | partition table   | pa  | 
>  public | c1_a_idx | horiguti | partition index   | pa_a_idx| c1
>  public | c2   | horiguti | partition table   | pa  | 
>  public | c2_a_idx | horiguti | partition index   | pa_a_idx| c2
>  public | pa   | horiguti | partitioned table | | 
>  public | pa_a_idx | horiguti | partitioned index | | pa
> (6 rows)

I think it's intentional that leaf partitions are not shown, because the
patch is meant to allow showing only the "partitioned" members of
partition trees.  Regular \d[t|i][+] shows normal tables ('r', 'i', etc.)
including their size, but it's useless for partitioned tables ('P', 'I',
etc.) as far as showing the size is concerned, so the patch.  Even \dPn is
meant to only show partitions that are themselves partitioned; note the
"P" in the command.

Thanks,
Amit





Re: New vacuum option to do only freezing

2019-04-02 Thread Masahiko Sawada
On Wed, Apr 3, 2019 at 10:56 AM Kyotaro HORIGUCHI
 wrote:
>
> Hello.
>
> At Mon, 1 Apr 2019 14:26:15 +0900, Masahiko Sawada  
> wrote in 
> > On Sat, Mar 30, 2019 at 5:04 AM Robert Haas  wrote:
> > >
> > > On Fri, Mar 29, 2019 at 12:27 PM Masahiko Sawada  
> > > wrote:
> > > > Yeah, but since multiple relations might be specified in VACUUM
> > > > command we need to process index_cleanup option after opened each
> > > > relations. Maybe we need to process all options except for
> > > > INDEX_CLEANUP in ExecVacuum() and pass VacuumStmt down to vacuum_rel()
> > > > and process it  in manner of you suggested after opened the relation.
> > > > Is that right?
> > >
> > > Blech, no, let's not do that.  We'd better use some method that can
> > > indicate yes/no/default.  Something like psql's trivalue thingy, but
> > > probably not exactly that.  We can define an enum for this purpose,
> > > for example - VACUUM_INDEX_CLEANUP_{ENABLED,DISABLED,DEFAULT}.  Or
> > > maybe there's some other way.  But let's not pass bits of the parse
> > > tree around any more than really required.
> >
> > I've defined an enum VacOptTernaryValue representing
> > enabled/disabled/default and added index_cleanup variable as the new
>

Thank you for reviewing the patch!

> It is defined as ENABLED=0, DISABLED=1, DEFAULT=2. At leat
> ENABLED=0 and DISABLED=1 are misleading.

Indeed, will fix.

>
> > enum type to VacuumParams. The vacuum options that uses the reloption
> > value as the default value such as index cleanup and new truncation
> > option can use this enum and set either enabled or disabled after
> > opened the relation when it’s set to default. Please find the attached
> > patches.
>
> +static VacOptTernaryValue get_vacopt_ternary_value(DefElem *def);
>
> This is actually a type converter of boolean. It is used to read
> VACUUM option but not used to read reloption. It seems useless.
>
>
> Finally the ternary value is determined to true or false before
> use. It is simple that index_cleanup finaly be read as bool. We
> could add another boolean to indicate that the value is set or
> not, but I think it would be better that the ternary type is a
> straightfoward expansion of bool.{DEFAULT = -1, DISABLED = 0,
> ENABLED = 1} and make sure that index_cleanup is not DEFAULT at a
> certain point.
>
> So, how about this?
>
> #define VACOPT_TERNARY_DEFAULT -1
> typedef int VacOptTernaryValue;  /* -1 is undecided, 0 is false, 1 is true */

Hmm, if we do that we set either VAOPT_TERNARY_DEFAULT, true or false
to index_cleanup, but I'm not sure this is a good approach. I think we
would want VACOPT_TERNARY_TRUE and VACOPT_TERNARY_FALSE as we defined
new type as a ternary value and already have VACOPT_TERNARY_DEFAULT.

>
> /* No longer the value mustn't be left DEFAULT */
> Assert (params->index_cleanup != VACOPT_TERNARY_DEFAULT);

Agreed, will add it.

>
>
> > > > > I think it would be better to just ignore the INDEX_CLEANUP option
> > > > > when FULL is specified.
> > > >
> > > > Okay, but why do we ignore that in this case while we complain in the
> > > > case of FULL and DISABLE_PAGE_SKIPPING?
> > >
> > > Well, that's a fair point, I guess.  If we go that that route, we'll
> > > need to make sure that setting the reloption doesn't prevent VACUUM
> > > FULL from working -- the complaint must only affect an explicit option
> > > on the VACUUM command line.  I think we should have a regression test
> > > for that.
> >
> > I've added regression tests. Since we check it before setting
> > index_cleanup based on reloptions we doesn't prevent VAUCUM FULL from
> > working even when the vacuum_index_cleanup is false.
>
> +  errmsg("VACUUM option INDEX_CLEANUP cannot be set to false with FULL")));
>
> I'm not one to talk on this, but this seems somewhat confused.
>
> "VACUUM option INDEX_CLEANUP cannot be set to false with FULL being specified"
>
> or
>
> "INDEX_CLEANUP cannot be disabled for VACUUM FULL"?

I prefer the former, will fix.

>
>
> And in the following part:
>
> +   /* Set index cleanup option based on reloptions */
> +   if (params->index_cleanup == VACUUM_OPTION_DEFAULT)
> +   {
> +   if (onerel->rd_options == NULL ||
> +   ((StdRdOptions *) 
> onerel->rd_options)->vacuum_index_cleanup)
> +   params->index_cleanup = VACUUM_OPTION_ENABLED;
> +   else
> +   params->index_cleanup = VACUUM_OPTION_DISABLED;
> +   }
> +
>
> The option should not be false while VACUUM FULL,

I think that we need to complain only when INDEX_CLEANUP option is
disabled by an explicit option on the VACUUM command and FULL option
is specified. It's no problem when vacuum_index_cleanup is false and
FULL option is true. Since internally we don't use index cleanup when
vacuum full I guess that we don't need to require index_cleanup being
always true even when full option is specified.

> and maybe we
> should complain in WARNING or NOTICE that the relopt is 

Re: ToDo: show size of partitioned table

2019-04-02 Thread Kyotaro HORIGUCHI
At Tue, 2 Apr 2019 22:49:25 -0300, Alvaro Herrera  
wrote in <20190403014925.GA9976@alvherre.pgsql>
> Hi Amit
> 
> On 2019-Apr-03, Amit Langote wrote:
> 
> > Thanks.  I'm getting an error while applying v12:
> > 
> > patching file src/bin/psql/po/es.po
> > Hunk #1 FAILED at 12.
> > 1 out of 2 hunks FAILED -- saving rejects to file src/bin/psql/po/es.po.rej
> 
> Uh, I certainly did not intend to send the es.po part of the patch, I
> was just testing that the translation worked as intended.  That's what I
> get for doing stuff till the last minute ...

\dPn doesn't show children because the are of 'r' relkind. And
\dPn * doesn't show type for children.

create table pa (a int, b int) partition by range (a);
create table c1 partition of pa for values from (0) to (10);
create table c2 partition of pa for values from (10) to (20);
create index on pa(a);

postgres=# \dPn *
  List of partitioned tables and indexes
 Schema |   Name   |  Owner   |   Type| Parent name | On table 
+--+--+---+-+--
 public | pa   | horiguti | partitioned table | | 
 public | pa_a_idx | horiguti | partitioned index | | pa
(2 rows)


With the attached on top of the latest patch, we get the following result.

postgres=# \dPn *
  List of partitioned tables and indexes
 Schema |   Name   |  Owner   |   Type| Parent name | On table 
+--+--+---+-+--
 public | c1   | horiguti | partition table   | pa  | 
 public | c1_a_idx | horiguti | partition index   | pa_a_idx| c1
 public | c2   | horiguti | partition table   | pa  | 
 public | c2_a_idx | horiguti | partition index   | pa_a_idx| c2
 public | pa   | horiguti | partitioned table | | 
 public | pa_a_idx | horiguti | partitioned index | | pa
(6 rows)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 736cca33d6..5f97bfc4b2 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3954,7 +3954,11 @@ listPartitions(const char *pattern, bool verbose, bool show_indexes,
 		}
 	}
 
-	appendPQExpBuffer(, "\nWHERE c.relkind IN (%s)", relkind_str);
+	if (show_nested)
+		appendPQExpBuffer(, "\nWHERE (c.relkind IN (%s) OR c3.relkind IN (%s))", relkind_str, relkind_str);
+	else
+		appendPQExpBuffer(, "\nWHERE c.relkind IN (%s)", relkind_str);
+		
 	appendPQExpBufferStr(, !show_nested ? " AND NOT c.relispartition\n" : "\n");
 
 	if (!pattern)
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 5f97bfc4b2..41075f9a67 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3864,9 +3864,13 @@ listPartitions(const char *pattern, bool verbose, bool show_indexes,
 	  ",\n  CASE c.relkind"
 	  " WHEN " CppAsString2(RELKIND_PARTITIONED_TABLE) " THEN '%s'"
 	  " WHEN " CppAsString2(RELKIND_PARTITIONED_INDEX) " THEN '%s'"
+	  " WHEN " CppAsString2(RELKIND_RELATION) " THEN '%s'"
+	  " WHEN " CppAsString2(RELKIND_INDEX) " THEN '%s'"
 	  " END as \"%s\"",
 	  gettext_noop("partitioned table"),
 	  gettext_noop("partitioned index"),
+	  gettext_noop("partition table"),
+	  gettext_noop("partition index"),
 	  gettext_noop("Type"));
 
 		translate_columns[3] = true;


Re: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-02 Thread Amit Kapila
On Thu, Mar 28, 2019 at 11:43 AM Haribabu Kommi
 wrote:
>
> On Wed, Mar 27, 2019 at 11:27 PM Amit Kapila  wrote:
>>
>> On Wed, Mar 27, 2019 at 6:53 AM Haribabu Kommi  
>> wrote:
>> > On Tue, Mar 26, 2019 at 9:08 PM Amit Kapila  
>> > wrote:
>> >>
>> >>   As part of this thread, maybe we can
>> >> just fix the case of the parallel cooperating transaction.
>> >
>> >
>> > With the current patch, all the parallel implementation transaction are 
>> > getting
>> > skipped, in my tests parallel workers are the major factor in the 
>> > transaction
>> > stats counter. Even before parallelism, the stats of the autovacuum and etc
>> > are still present but their contribution is not greatly influencing the 
>> > stats.
>> >
>> > I agree with you in fixing the stats with parallel workers and improve 
>> > stats.
>> >
>>
>> I was proposing to fix only the transaction started with
>> StartParallelWorkerTransaction by using is_parallel_worker flag as
>> discussed above.  I understand that it won't completely fix the
>> problem reported by you, but it will be a step in that direction.  My
>> main worry is that if we fix it the way you are proposing and we also
>> invent a new way to deal with all other internal transactions, then
>> the fix done by us now needs to be changed/reverted.  Note, that this
>> fix needs to be backpatched as well, so we should avoid doing any fix
>> which needs to be changed or reverted.
>
>
> I tried the approach as your suggested as by not counting the actual parallel 
> work
> transactions by just releasing the resources without touching the counters,
> the counts are not reduced much.
>
> HEAD - With 4 parallel workers running query generates 13 stats ( 4 * 3  + 1)
> Patch -  With 4 parallel workers running query generates 9 stats ( 4 * 2 + 1)
> Old approach patch - With 4 parallel workers running query generates 1 stat 
> (1)
>
> Currently the parallel worker start transaction 3 times in the following 
> places.
> 1. InitPostgres
> 2. ParallelWorkerMain (2)
>
> with the attached patch, we reduce one count from ParallelWorkerMain.
>

Right, that is expected from this fix.  BTW, why you have changed the
approach in this patch to not count anything by the parallel worker as
compared to the earlier version where you were just ignoring the stats
for transactions.  As of now, either way is fine, but in future (after
parallel inserts/updates), we want other stats to be updated.  I think
your previous idea was better, just you need to start using
is_parallel_worker flag.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: New vacuum option to do only freezing

2019-04-02 Thread Kyotaro HORIGUCHI
Hello.

At Mon, 1 Apr 2019 14:26:15 +0900, Masahiko Sawada  
wrote in 
> On Sat, Mar 30, 2019 at 5:04 AM Robert Haas  wrote:
> >
> > On Fri, Mar 29, 2019 at 12:27 PM Masahiko Sawada  
> > wrote:
> > > Yeah, but since multiple relations might be specified in VACUUM
> > > command we need to process index_cleanup option after opened each
> > > relations. Maybe we need to process all options except for
> > > INDEX_CLEANUP in ExecVacuum() and pass VacuumStmt down to vacuum_rel()
> > > and process it  in manner of you suggested after opened the relation.
> > > Is that right?
> >
> > Blech, no, let's not do that.  We'd better use some method that can
> > indicate yes/no/default.  Something like psql's trivalue thingy, but
> > probably not exactly that.  We can define an enum for this purpose,
> > for example - VACUUM_INDEX_CLEANUP_{ENABLED,DISABLED,DEFAULT}.  Or
> > maybe there's some other way.  But let's not pass bits of the parse
> > tree around any more than really required.
> 
> I've defined an enum VacOptTernaryValue representing
> enabled/disabled/default and added index_cleanup variable as the new

It is defined as ENABLED=0, DISABLED=1, DEFAULT=2. At leat
ENABLED=0 and DISABLED=1 are misleading. 

> enum type to VacuumParams. The vacuum options that uses the reloption
> value as the default value such as index cleanup and new truncation
> option can use this enum and set either enabled or disabled after
> opened the relation when it’s set to default. Please find the attached
> patches.

+static VacOptTernaryValue get_vacopt_ternary_value(DefElem *def);

This is actually a type converter of boolean. It is used to read
VACUUM option but not used to read reloption. It seems useless.


Finally the ternary value is determined to true or false before
use. It is simple that index_cleanup finaly be read as bool. We
could add another boolean to indicate that the value is set or
not, but I think it would be better that the ternary type is a
straightfoward expansion of bool.{DEFAULT = -1, DISABLED = 0,
ENABLED = 1} and make sure that index_cleanup is not DEFAULT at a
certain point.

So, how about this?

#define VACOPT_TERNARY_DEFAULT -1
typedef int VacOptTernaryValue;  /* -1 is undecided, 0 is false, 1 is true */

/* No longer the value mustn't be left DEFAULT */
Assert (params->index_cleanup != VACOPT_TERNARY_DEFAULT);


> > > > I think it would be better to just ignore the INDEX_CLEANUP option
> > > > when FULL is specified.
> > >
> > > Okay, but why do we ignore that in this case while we complain in the
> > > case of FULL and DISABLE_PAGE_SKIPPING?
> >
> > Well, that's a fair point, I guess.  If we go that that route, we'll
> > need to make sure that setting the reloption doesn't prevent VACUUM
> > FULL from working -- the complaint must only affect an explicit option
> > on the VACUUM command line.  I think we should have a regression test
> > for that.
> 
> I've added regression tests. Since we check it before setting
> index_cleanup based on reloptions we doesn't prevent VAUCUM FULL from
> working even when the vacuum_index_cleanup is false.

+  errmsg("VACUUM option INDEX_CLEANUP cannot be set to false with FULL")));

I'm not one to talk on this, but this seems somewhat confused.

"VACUUM option INDEX_CLEANUP cannot be set to false with FULL being specified"

or

"INDEX_CLEANUP cannot be disabled for VACUUM FULL"?


And in the following part:

+   /* Set index cleanup option based on reloptions */
+   if (params->index_cleanup == VACUUM_OPTION_DEFAULT)
+   {
+   if (onerel->rd_options == NULL ||
+   ((StdRdOptions *) 
onerel->rd_options)->vacuum_index_cleanup)
+   params->index_cleanup = VACUUM_OPTION_ENABLED;
+   else
+   params->index_cleanup = VACUUM_OPTION_DISABLED;
+   }
+

The option should not be false while VACUUM FULL, and maybe we
should complain in WARNING or NOTICE that the relopt is ignored.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


Re: Problem with default partition pruning

2019-04-02 Thread Amit Langote
Hosoya-san,

On 2019/04/02 14:02, Yuzuko Hosoya wrote:
> Hi,
> 
>> Maybe we should have two patches as we seem to be improving two things:
>>
>> 1. Patch to fix problems with default partition pruning originally reported 
>> by Hosoya-san
>>
>> 2. Patch to determine if a given clause contradicts a sub-partitioned 
>> table's partition constraint,
>> fixing problems unearthed by Thibaut's tests
> 
> I attached the latest patches according to Amit comment.
> v3_default_partition_pruning.patch fixes default partition pruning problems
> and ignore_contradictory_where_clauses_at_partprune_step.patch fixes
> sub-partition problems Thibaut tested.

Thanks for dividing patches that way.

Would it be a good idea to add some new test cases to these patches, just
so it's easily apparent what we're changing?

So, we could add the test case presented by Thibaut at the following link
to the default_partition_pruning.patch:

https://www.postgresql.org/message-id/a4968068-6401-7a9c-8bd4-6a3bc9164a86%40dalibo.com

And, another reported at the following link to
ignore_contradictory_where_clauses_at_partprune_step.patch:

https://www.postgresql.org/message-id/bd03f475-30d4-c4d0-3d7f-d2fbde755971%40dalibo.com

Actually, it might be possible/better to construct the test queries in
partition_prune.sql using the existing tables in that script, that is,
without defining new tables just for adding the new test cases.  If not,
maybe it's OK to create the new tables too.

Thanks,
Amit





Re: ToDo: show size of partitioned table

2019-04-02 Thread Alvaro Herrera
Hi Amit

On 2019-Apr-03, Amit Langote wrote:

> Thanks.  I'm getting an error while applying v12:
> 
> patching file src/bin/psql/po/es.po
> Hunk #1 FAILED at 12.
> 1 out of 2 hunks FAILED -- saving rejects to file src/bin/psql/po/es.po.rej

Uh, I certainly did not intend to send the es.po part of the patch, I
was just testing that the translation worked as intended.  That's what I
get for doing stuff till the last minute ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 10:10:34AM +0200, Fabien COELHO wrote:
> For online, we should want throttling so that the check can have a reduced
> performance impact when scrubbing.

Yes.  Throttling is a necessary property in my opinion, perhaps in
combination with some autovacuum-like options to only trigger the
checks for a relation after a certain amout of activity has happened
on it.
--
Michael


signature.asc
Description: PGP signature


Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 04:02:59PM +0200, Michael Banck wrote:
> Can you explain in more detail how this would work? I thought we came to
> the conclusion (and the documentation seems to indicate so), that you
> should stop all participating instances of a cluster and then enable
> checksums on all of them, which would impose a downtime.

That's what I was pointing out here:
https://www.postgresql.org/message-id/20190320225924.gc20...@paquier.xyz

I still agree about keeping in the docs safer recommendations, in the
shape of the ones currently present, than what I mentioned on the
other thread.
--
Michael


signature.asc
Description: PGP signature


Re: Checksum errors in pg_stat_database

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 07:06:35PM +0200, Magnus Hagander wrote:
> I think having the count and hte last time make sense, but I'm very
> sceptical about the rest.

There may be other things which we are not considering on this
thread.  I don't know.

> I can somewhat agree that splitting it  on a per database level might even
> at that be overdoing it. What might actually be more interesting from a
> failure-location perspective would be tablespace, rather than any of the
> others. Or we could reduce it down to just putting it in pg_stat_bgwriter
> and only count global values perhaps, if in the end we don't think the
> split-per-database is reasonable?

A split per database or per tablespace is I think a very good thing.
This helps in tracking down which partitions have gone crazy, and a
single global counter does not allow that.
--
Michael


signature.asc
Description: PGP signature


Re: ToDo: show size of partitioned table

2019-04-02 Thread Amit Langote
Hi Alvaro,

On 2019/04/03 4:42, Alvaro Herrera wrote:
> On 2019-Apr-02, Alvaro Herrera wrote:
> 
>> Something is not right:
> 
> Another thing that was not right is that translated_columns was being
> marked static, and modified in the function; so if you called \dP twice
> where one called for translation and the second not, the second time
> we'd also translating that column.
> 
> Attached contains fixes for those three issues.

Thanks.  I'm getting an error while applying v12:

patching file src/bin/psql/po/es.po
Hunk #1 FAILED at 12.
1 out of 2 hunks FAILED -- saving rejects to file src/bin/psql/po/es.po.rej

Did you accidentally add it to the patch?

Thanks,
Amit





Re: Ordered Partitioned Table Scans

2019-04-02 Thread Amit Langote
On 2019/04/03 3:10, David Rowley wrote:
> On Wed, 3 Apr 2019 at 01:26, Amit Langote  wrote:
>> +#include "nodes/pathnodes.h"
>> ...
>> +partitions_are_ordered(struct RelOptInfo *partrel)
>>
>> Maybe, "struct" is unnecessary?
> 
> I just left it there so that the signature matched the header file.
> Looking around for examples I see make_partition_pruneinfo() has the
> structs only in the header file, so I guess that is how we do things,
> so changed to that in the attached.

Ah, I see.  Thanks for updating the patch.

I don't have any more comments.

Thanks,
Amit





Re: [PATCH v20] GSSAPI encryption support

2019-04-02 Thread Joe Conway
On 4/2/19 6:18 PM, Stephen Frost wrote:
> Greetings,
> 
> On Tue, Apr 2, 2019 at 18:10 Peter Eisentraut
>  > wrote:
> 
> On 2019-02-23 17:27, Stephen Frost wrote:
> >> About pg_hba.conf: The "hostgss" keyword seems a bit confusing. 
> It only
> >> applies to encrypted gss-using connections, not all of them.  Maybe
> >> "hostgssenc" or "hostgsswrap"?
> > Not quite sure what you mean here, but 'hostgss' seems to be quite
> well
> > in-line with what we do for SSL...  as in, we have 'hostssl', we don't
> > say 'hostsslenc'.  I feel like I'm just not understanding what you
> mean
> > by "not all of them".
> 
> Reading the latest patch, I think this is still a bit confusing.
> Consider an entry like
> 
>     hostgss all             all             0.0.0.0/0
>                gss
> 
> The "hostgss" part means, the connection is GSS-*encrypted*.  The "gss"
> entry in the last column means use gss for *authentication*.  But didn't
> "hostgss" already imply that?  No.  I understand what's going on, but it
> seems quite confusing.  They both just say "gss"; you have to know a lot
> about the nuances of pg_hba.conf processing to get that.
> 
> If you have line like
> 
>     hostgss all             all             0.0.0.0/0
>                md5
> 
> it is not obvious that this means, if GSS-encrypted, use md5.  It could
> just as well mean, if GSS-authenticated, use md5.
> 
> The analogy with SSL is such that we use "hostssl" for connections using
> SSL encryption and "cert" for the authentication method.  So there we
> use two different words for two different aspects of SSL.
> 
> 
> I don’t view it as confusing, but I’ll change it to hostgssenc as was
> suggested earlier to address that concern.  It’s a bit wordy but if it
> helps reduce confusion then that’s a good thing.

Personally I don't find it as confusing as is either, and I find hostgss
to be a good analog of hostssl. On the other hand hostgssenc is long and
unintuitive. So +1 for leaving as is and -1 one for changing it IMHO.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development




Re: [PATCH v20] GSSAPI encryption support

2019-04-02 Thread Stephen Frost
Greetings,

On Tue, Apr 2, 2019 at 18:10 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 2019-02-23 17:27, Stephen Frost wrote:
> >> About pg_hba.conf: The "hostgss" keyword seems a bit confusing.  It only
> >> applies to encrypted gss-using connections, not all of them.  Maybe
> >> "hostgssenc" or "hostgsswrap"?
> > Not quite sure what you mean here, but 'hostgss' seems to be quite well
> > in-line with what we do for SSL...  as in, we have 'hostssl', we don't
> > say 'hostsslenc'.  I feel like I'm just not understanding what you mean
> > by "not all of them".
>
> Reading the latest patch, I think this is still a bit confusing.
> Consider an entry like
>
> hostgss all all 0.0.0.0/0   gss
>
> The "hostgss" part means, the connection is GSS-*encrypted*.  The "gss"
> entry in the last column means use gss for *authentication*.  But didn't
> "hostgss" already imply that?  No.  I understand what's going on, but it
> seems quite confusing.  They both just say "gss"; you have to know a lot
> about the nuances of pg_hba.conf processing to get that.
>
> If you have line like
>
> hostgss all all 0.0.0.0/0   md5
>
> it is not obvious that this means, if GSS-encrypted, use md5.  It could
> just as well mean, if GSS-authenticated, use md5.
>
> The analogy with SSL is such that we use "hostssl" for connections using
> SSL encryption and "cert" for the authentication method.  So there we
> use two different words for two different aspects of SSL.


I don’t view it as confusing, but I’ll change it to hostgssenc as was
suggested earlier to address that concern.  It’s a bit wordy but if it
helps reduce confusion then that’s a good thing.

Thanks,

Stephen

>


Re: [PATCH v20] GSSAPI encryption support

2019-04-02 Thread Peter Eisentraut
On 2019-02-23 17:27, Stephen Frost wrote:
>> About pg_hba.conf: The "hostgss" keyword seems a bit confusing.  It only
>> applies to encrypted gss-using connections, not all of them.  Maybe
>> "hostgssenc" or "hostgsswrap"?
> Not quite sure what you mean here, but 'hostgss' seems to be quite well
> in-line with what we do for SSL...  as in, we have 'hostssl', we don't
> say 'hostsslenc'.  I feel like I'm just not understanding what you mean
> by "not all of them".

Reading the latest patch, I think this is still a bit confusing.
Consider an entry like

hostgss all all 0.0.0.0/0   gss

The "hostgss" part means, the connection is GSS-*encrypted*.  The "gss"
entry in the last column means use gss for *authentication*.  But didn't
"hostgss" already imply that?  No.  I understand what's going on, but it
seems quite confusing.  They both just say "gss"; you have to know a lot
about the nuances of pg_hba.conf processing to get that.

If you have line like

hostgss all all 0.0.0.0/0   md5

it is not obvious that this means, if GSS-encrypted, use md5.  It could
just as well mean, if GSS-authenticated, use md5.

The analogy with SSL is such that we use "hostssl" for connections using
SSL encryption and "cert" for the authentication method.  So there we
use two different words for two different aspects of SSL.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Pluggable Storage - Andres's take

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:
> From a72cfcd523887f1220473231d7982928acc23684 Mon Sep 17 00:00:00 2001
> From: Hari Babu 
> Date: Tue, 2 Apr 2019 15:41:17 +1100
> Subject: [PATCH 1/2] tableam : doc update of table access methods
> 
> Providing basic explanation of table access methods
> including their structure details and reference heap
> implementation files.
> ---
>  doc/src/sgml/catalogs.sgml |  5 ++--
>  doc/src/sgml/filelist.sgml |  1 +
>  doc/src/sgml/postgres.sgml |  1 +
>  doc/src/sgml/tableam.sgml  | 56 ++
>  4 files changed, 61 insertions(+), 2 deletions(-)
>  create mode 100644 doc/src/sgml/tableam.sgml
> 
> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index f4aabf5dc7..200708e121 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -587,8 +587,9 @@
> The catalog pg_am stores information about
> relation access methods.  There is one row for each access method 
> supported
> by the system.
> -   Currently, only indexes have access methods.  The requirements for index
> -   access methods are discussed in detail in .
> +   Currently, only table and indexes have access methods. The requirements 
> for table
> +   access methods are discussed in detail in  and 
> the
> +   requirements for index access methods are discussed in detail in  linkend="indexam"/>.
>

I also adapted pg_am.amtype.


> diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
> new file mode 100644
> index 00..9eca52ee70
> --- /dev/null
> +++ b/doc/src/sgml/tableam.sgml
> @@ -0,0 +1,56 @@
> +
> +
> +
> + Table Access Method Interface Definition
> + 
> +  
> +   This chapter defines the interface between the core 
> PostgreSQL
> +   system and access methods, which manage 
> TABLE
> +   types. The core system knows nothing about these access methods beyond
> +   what is specified here, so it is possible to develop entirely new access
> +   method types by writing add-on code.
> +  
> +  
> +  
> +   All Tables in PostgreSQL system are the primary
> +   data store. Each table is stored as its own physical 
> relation
> +   and so is described by an entry in the pg_class
> +   catalog. A table's content is entirely controlled by its access method.
> +   (All the access methods furthermore use the standard page layout described
> +   in .)
> +  

I don't think there's actually any sort of dependency on the page
layout. It's entirely conceivable to write an AM that doesn't use
postgres' shared buffers.

> +  
> +   A table access method handler function must be declared to accept a single
> +   argument of type internal and to return the pseudo-type
> +   table_am_handler.  The argument is a dummy value that simply
> +   serves to prevent handler functions from being called directly from SQL 
> commands.

> +   The result of the function must be a palloc'd struct of type 
> TableAmRoutine,
> +   which contains everything that the core code needs to know to make use of
> +   the table access method.

That's not been correct for a while...


> The TableAmRoutine struct,
> +   also called the access method's API struct, 
> includes
> +   fields specifying assorted fixed properties of the access method, such as
> +   whether it can support bitmap scans.  More importantly, it contains 
> pointers
> +   to support functions for the access method, which do all of the real work 
> to
> +   access tables. These support functions are plain C functions and are not
> +   visible or callable at the SQL level.  The support functions are described
> +   in TableAmRoutine structure. For more details, 
> please
> +   refer the file src/include/access/tableam.h.
> +  

This seems to not have been adapted after copying it from indexam?


I'm still working on this (in particular I think storage.sgml and
probably some other places needs updates to make clear they apply to
heap not generally; I think there needs to be some references to generic
WAL records in tableam.sgml, ...), but I got to run a few errands.

One thing I want to call out is that I made the reference to
src/include/access/tableam.h a link to gitweb. I think that makes it
much more useful to the casual reader. But it also means that, baring
some infrastructure / procedure we don't have, the link will just
continue to point to master. I'm not particularly concerned about that,
but it seems worth pointing out, given that we've only a single link to
gitweb in the sgml docs so far.

Greetings,

Andres Freund
>From 33826b211e3f2d5fe76024e2938ff2eb6aeb00da Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 2 Apr 2019 14:55:30 -0700
Subject: [PATCH] tableam: docs

---
 doc/src/sgml/catalogs.sgml|  9 +-
 doc/src/sgml/config.sgml  | 17 
 doc/src/sgml/filelist.sgml|  1 +
 doc/src/sgml/postgres.sgml|  1 +
 doc/src/sgml/ref/create_access_method.sgml| 

Re: amcheck verification for GiST

2019-04-02 Thread Peter Geoghegan
On Thu, Mar 28, 2019 at 11:30 PM Andrey Borodin  wrote:
> Here's updated patch with AccessShareLock.
> I've tried to stress this with combination of random inserts, vaccuums and 
> checks. This process neither failed, nor deadlocked.
> The patch intentionally contains one superflous line to make gist logically 
> broken. This triggers regression test of amcheck.

Any thoughts on this, Heikki?

It would be nice to be able to squeeze this into Postgres 12,
especially given that GiST has been enhanced for 12 already.

-- 
Peter Geoghegan




Re: PostgreSQL pollutes the file system

2019-04-02 Thread Fred .Flintstone
It looks like this thread is featured on LWN under the article:
Program names and "pollution".
https://lwn.net/
https://lwn.net/Articles/784508/ (Subscription required)

On Sat, Mar 30, 2019 at 12:27 PM Peter Eisentraut
 wrote:
>
> On 2019-03-29 20:32, Joe Conway wrote:
> >   pg_util  
>
> How is that better than just renaming to pg_$oldname?
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unified logging system for command-line programs

2019-04-02 Thread Alvaro Herrera
On 2019-Apr-02, Tom Lane wrote:

> Alvaro Herrera  writes:
> > I don't much like the code that does
> >pg_log_error("%s", something);
> 
> > because then the string "%s" is marked for translation.
> 
> Uh, surely we've got hundreds of instances of that in the system already?

Actually, we don't have that many, but there are more than I remembered
there being -- my memory was telling me that I had erradicated them all
in commit 55a70a023c3d but that's sadly misinformed.  Seeing this (and
also because the API would become nastier than I thought it would), I'll
leave this stuff be for now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unified logging system for command-line programs

2019-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> I don't much like the code that does
>pg_log_error("%s", something);

> because then the string "%s" is marked for translation.

Uh, surely we've got hundreds of instances of that in the system already?

> Maybe we should
> consider a variant that takes a straight string literal instead of a
> sprintf-style fmt to avoid this problem.  We'd do something like
>pg_log_error_v(something);
> which does not call _() within.

What it looks like that's doing is something similar to appendPQExpBuffer
versus appendPQExpBufferStr, ie, just skipping the overhead of sprintf
format processing when you don't need it.  The implications for
translatability or not are unobvious, so I'm afraid this would result
in confusion and missed translations.

I'm not necessarily against some idea like this, but how do we
separate "translatability" from "sprintf formatting"?

regards, tom lane




Re: Unified logging system for command-line programs

2019-04-02 Thread Alvaro Herrera
I don't much like the code that does

   pg_log_error("%s", something);

because then the string "%s" is marked for translation.  Maybe we should
consider a variant that takes a straight string literal instead of a
sprintf-style fmt to avoid this problem.  We'd do something like

   pg_log_error_v(something);

which does not call _() within.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ToDo: show size of partitioned table

2019-04-02 Thread Alvaro Herrera
On 2019-Apr-02, Alvaro Herrera wrote:

> Something is not right:

Another thing that was not right is that translated_columns was being
marked static, and modified in the function; so if you called \dP twice
where one called for translation and the second not, the second time
we'd also translating that column.

Attached contains fixes for those three issues.

I'm gonna go over the docs now.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b29e7547c6a..a789042ce5f 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1659,6 +1659,72 @@ testdb=
 
   
 
+
+  
+\dP[n+] [ pattern ]
+
+
+Lists partitioned relations.  If pattern is specified, only entries
+whose name matches the pattern are listed.  By default, only
+partitioned tables are listed; supply a pattern to also include
+partitioned indexes.  If the form \dP+
+is used, the sum of sizes of table's partitions (including their
+indexes) is also displayed, along with the associated description.
+
+
+
+If the modifier n (nested) is used,
+then non-root partitioned tables are included, and a column is shown
+displaying the parent of each partitioned relation.
+
+If n is combined with +, two
+sizes are shown: one including the total size of directly-attached
+leaf partitions, and another showing the total size of all partitions,
+including indirectly attached sub-partitions.
+
+
+  
+
+
+  
+\dPi[n+] [ pattern ]
+
+
+Lists partitioned indexes. If pattern is specified, only entries
+whose name matches the pattern are listed. If the form
+\dPi+ is used, the sum of sizes of index's
+partitions is also displayed, along with the associated description.
+
+
+
+ If the n modifier is used, non-root partitioned
+ indexes are displayed too.
+
+
+  
+
+
+  
+\dPt[n+] [ pattern ]
+
+
+Lists partitioned tables. If pattern is specified, only entries
+whose name matches the pattern are listed. If the form
+\dPt+ is used, the sum of sizes of table's
+partitions is also displayed, along with the associated description.
+
+
+
+ If the n modifier is used, non-root partitioned
+ tables are displayed too.
+
+
+  
+
+
   
 \drds [ role-pattern [ database-pattern ] ]
 
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 72188b7f3ef..ce8bee99087 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -782,6 +782,42 @@ exec_command_d(PsqlScanState scan_state, bool active_branch, const char *cmd)
 			case 'p':
 success = permissionsList(pattern);
 break;
+			case 'P':
+{
+	bool show_nested = strchr(cmd, 'n') ? true : false;
+
+	switch (cmd[2])
+	{
+		case 'i':
+			/* show indexes only */
+			success = listPartitions(pattern, show_verbose,
+	 true, false,
+	 show_nested);
+			break;
+		case 't':
+			/* show tables only */
+			success = listPartitions(pattern, show_verbose,
+	 false, true,
+	 show_nested);
+			break;
+		case '+':
+		case '\0':
+		case 'n':
+			/*
+			 * Show only tables if there is no pattern.  Also
+			 * show indexes if pattern is specified.  Show
+			 * total size if verbose output is specified.
+			 */
+			success = listPartitions(pattern, show_verbose,
+	 false, false,
+	 show_nested);
+			break;
+		default:
+			status = PSQL_CMD_UNKNOWN;
+			break;
+	}
+}
+break;
 			case 'T':
 success = describeTypes(pattern, show_verbose, show_system);
 break;
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 036810303a6..736cca33d6b 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3774,6 +3774,247 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 	return true;
 }
 
+/*
+ * listPartitions()
+ *
+ * handler for \dP[t|i][n]
+ */
+bool
+listPartitions(const char *pattern, bool verbose, bool show_indexes,
+			   bool show_tables, bool show_nested)
+{
+	PQExpBufferData buf;
+	PGresult   *res;
+	printQueryOpt myopt = pset.popt;
+	bool translate_columns[] = {false, false, false, false, false, false, false, false, false};
+	const char *size_function;
+	const char *relkind_str;
+	const char *errmsg_pattern;
+	const char *errmsg_no_pattern;
+	const char *tabletitle;
+	bool		mixed_output = false;
+
+	/*
+	 * Note: Declarative table partitions are only supported as of Pg 10.0.
+	 */

Re: Pluggable Storage - Andres's take

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:

> +  xreflabel="default_table_access_method">
> +  default_table_access_method 
> (string)
> +  
> +   default_table_access_method configuration 
> parameter
> +  
> +  
> +  
> +   
> +The value is either the name of a table access method, or an empty 
> string
> +to specify using the default table access method of the current 
> database.
> +If the value does not match the name of any existing table access 
> method,
> +PostgreSQL will automatically use the 
> default
> +table access method of the current database.
> +   

Hm, this doesn't strike me as right (there's no such thing as "default
table access method of the current database"). You just get an error in
that case. I think we should simply not allow setting to "" - what's the
point in that?

Greetings,

Andres Freund




Re: ToDo: show size of partitioned table

2019-04-02 Thread Alvaro Herrera
Something is not right:

alvherre=# \dP t
 List of partitioned tables or indexes
 Esquema | Nombre |  Dueño   |   Tipo|   On 
table   
-++--+---+--
 public  | t  | alvherre | partitioned table | Project-Id-Version: psql 
(PostgreSQL) 10+
 ||  |   | Report-Msgid-Bugs-To: 
pgsql-b...@postgresql.org +
 ||  |   | PO-Revision-Date: 2017-07-10 
12:14-0400 +
 ||  |   | Last-Translator: Álvaro 
Herrera+
 ||  |   | Language-Team: PgSQL Español 
+
 ||  |   | Language: es 
   +
 ||  |   | MIME-Version: 1.0
   +
 ||  |   | Content-Type: text/plain; 
charset=UTF-8 +
 ||  |   | Content-Transfer-Encoding: 
8bit +
 ||  |   | Plural-Forms: nplurals=2; 
plural=n != 1;+
 ||  |   | 
(1 fila)


-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: ToDo: show size of partitioned table

2019-04-02 Thread Alvaro Herrera
On 2019-Mar-13, Amit Langote wrote:

> + if (show_indexes)
> + {
> + objects_name = gettext_noop("indexes");
...
> + if (pattern)
> + /* translator: objects_name is "index", "table" */
> + psql_error("Did not find any partitioned %s named 
> \"%s\".\n",
> +object_name,
> +pattern);
...
> + /* translator: objects_name is "indexes", "tables" */
> + appendPQExpBuffer(, _("List of partitioned %s"), 
> objects_name);
> +

This stuff doesn't work from a translation standpoint; sentence building
is not allowed.  You have to do stuff like:

if (tables)
{
errmsg = gettext_noop("Could not find any partitioned table");
tabletitle = gettext_noop("List of partitioned tables");
}
 ...
if (pattern)
/* translator: objects_name is "index", "table" */
psql_error(gettext(errmsg), object_name, pattern);
  ...
appendPQExpBuffer(, gettext(tabletitle));

and so on.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] generated columns

2019-04-02 Thread Erik Rijkers

On 2019-04-02 15:36, Erik Rijkers wrote:

On 2019-04-02 14:43, Peter Eisentraut wrote:

On 2019-04-01 10:52, Peter Eisentraut wrote:

On 2019-03-31 05:49, Erik Rijkers wrote:

STORED in a
file_fdw foreign table still silently creates the column which then
turns out to be useless on SELECT, with an error like:

"ERROR:  column some_column_name is a generated column
DETAIL:  Generated columns cannot be used in COPY."

Maybe it'd be possible to get an error earlier, i.e., while trying 
to

create such a useless column?


I'll look into it.


I've been trying to create a test case for file_fdw for this, but I'm
not getting your result.  Can you send a complete test case?


attached is run_ft.sh  which creates a text file:  /tmp/pg_head.txt
then sets it up as a foreign table, and adds a generated column.

Then selects a succesful select, followed by a error-producing select.

Some selects are succesful but some fail.  I'm not sure why it sometimes 
fails  (it's not just the explicitness of the generated-column-name like 
I suggested earlier).




My output of run_ft.sh is below.


$ ./run_ft.sh
create schema if not exists "tmp";
CREATE SCHEMA
create server if not exists "tmpserver" foreign data wrapper file_fdw;
CREATE SERVER
drop   foreign table if exists tmp.pg_head cascade;
DROP FOREIGN TABLE
create foreign table   tmp.pg_head (
"Gene"  text,
"Ratio H/L normalized Exp1" numeric
)
server tmpserver
options (
delimiter E'\t'
  , format 'csv'
  , header 'TRUE'
  , filename  '/tmp/pg_head.txt'
);
CREATE FOREIGN TABLE
alter foreign table tmp.pg_head
   add column "Ratio H/L normalized Exp1 Log2 (Generated column)" 
numeric generated always as (case when "Ratio H/L normalized Exp1" > 0 
then log(2, "Ratio H/L normalized Exp1") else  null end) stored

;
ALTER FOREIGN TABLE
-- this is OK (although the generated-column values are all empty/null)
select
 "Gene"
   , "Ratio H/L normalized Exp1"
   , "Ratio H/L normalized Exp1 Log2 (Generated column)"
from tmp.pg_head
limit 3 ;
  Gene  | Ratio H/L normalized Exp1 | Ratio H/L normalized Exp1 Log2 
(Generated column)

+---+---
 Dhx9   |   NaN |
 Gapdh  |   0.42288 |
 Gm8797 |   0.81352 |
(3 rows)

-- but this fails
select
"Gene"
   , "Ratio H/L normalized Exp1 Log2 (Generated column)"
from tmp.pg_head
limit 3 ;
ERROR:  column "Ratio H/L normalized Exp1 Log2 (Generated column)" is a 
generated column

DETAIL:  Generated columns cannot be used in COPY.


#!/bin/bash

echo "Gene	Ratio H/L normalized Exp1
Dhx9	NaN
Gapdh	0.42288
Gm8797	0.81352
Aldh2	0.89913
Ccdc12	NaN
Hip1	NaN
Hist1h2aa	0.66911
Tpm2	0.57535
Fasn	NaN
Aldoa	0.61898
Unc13b	NaN
Wrn	0.0050816
Psma1	NaN
Ldha	0.90211
Numa1	NaN" > /tmp/pg_head.txt

psql -Xa << FT_SETUP_TXT

create schema if not exists "tmp";
create server if not exists "tmpserver" foreign data wrapper file_fdw;
drop   foreign table if exists tmp.pg_head cascade;
create foreign table   tmp.pg_head (
"Gene"  text,
"Ratio H/L normalized Exp1" numeric
)
server tmpserver
options (
delimiter E'\t'
  , format 'csv'
  , header 'TRUE'
  , filename  '/tmp/pg_head.txt'
);

alter foreign table tmp.pg_head
   add column "Ratio H/L normalized Exp1 Log2 (Generated column)" numeric generated always as (case when "Ratio H/L normalized Exp1" > 0 then log(2, "Ratio H/L normalized Exp1") else  null end) stored
;

FT_SETUP_TXT

psql -qXa << SQL_TXT

-- this is OK (although the generated-column values are all empty/null)
select 
 "Gene" 
   , "Ratio H/L normalized Exp1"
   , "Ratio H/L normalized Exp1 Log2 (Generated column)"
from tmp.pg_head
limit 3 ;

-- but this fails
select 
"Gene"
   , "Ratio H/L normalized Exp1 Log2 (Generated column)"
from tmp.pg_head
limit 3 ;

SQL_TXT





Re: Ordered Partitioned Table Scans

2019-04-02 Thread David Rowley
On Wed, 3 Apr 2019 at 01:26, Amit Langote  wrote:
> +#include "nodes/pathnodes.h"
> ...
> +partitions_are_ordered(struct RelOptInfo *partrel)
>
> Maybe, "struct" is unnecessary?

I just left it there so that the signature matched the header file.
Looking around for examples I see make_partition_pruneinfo() has the
structs only in the header file, so I guess that is how we do things,
so changed to that in the attached.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


mergeappend_to_append_conversion_v19.patch
Description: Binary data


Re: COPY FROM WHEN condition

2019-04-02 Thread Andres Freund
Gah, ignore this mail - I've somehow accidentally sent this early. I'm
not quite sure, but I miskeyed, and emacs sent it early, without going
through mutt...




Re: [HACKERS] proposal: schema variables

2019-04-02 Thread Pavel Stehule
út 26. 3. 2019 v 6:40 odesílatel Pavel Stehule 
napsal:

> Hi
>
> ne 24. 3. 2019 v 6:57 odesílatel Pavel Stehule 
> napsal:
>
>> Hi
>>
>> rebase against current master
>>
>
>
> fixed issue IF NOT EXISTS & related regress tests
>

another rebase

Regards

Pavel


> Regards
>
> Pavel
>
>
>> Regards
>>
>> Pavel
>>
>


schema-variables-20190402.patch.gz
Description: application/gzip


Re: COPY FROM WHEN condition

2019-04-02 Thread Andres Freund
On 2019-04-03 06:41:49 +1300, David Rowley wrote:
> However, I've ended up not doing it that way as the patch requires
> more than just an array of TupleTableSlots to be stored in the
> ResultRelInfo, it'll pretty much need all of what I have in
> CopyMultiInsertBuffer, which includes line numbers for error
> reporting, a BulkInsertState and a counter to track how many of the
> slots are used.  I had thoughts that we could just tag the whole
> CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it
> somewhere a bit more generic. Another thought would be that we have
> something like "void *extra;" in ResultRelInfo that we can just borrow
> for copy.c.  It may be interesting to try this out to see if it saves
> much in the way of performance.

Hm, we could just forwad declare struct CopyMultiInsertBuffer in a
header, and only define it in copy.c. That doesn't sound insane to me.


> I've got the patch into a working state now and wrote a bunch of
> comments. I've not really done a thorough read back of the code again.
> I normally wait until the light is coming from a different angle
> before doing that, but there does not seem to be much time to wait for
> that in this case, so here's v2.  Performance seems to be about the
> same as what I reported yesterday.

Cool.


> +/*
> + * Multi-Insert handling code for COPY FROM.  Inserts are performed in
> + * batches of up to MAX_BUFFERED_TUPLES.

This should probably be moved to the top of the file.


> + * When COPY FROM is used on a partitioned table, we attempt to maintain
> + * only MAX_PARTITION_BUFFERS buffers at once.  Buffers that are completely
> + * unused in each batch are removed so that we end up not keeping buffers for
> + * partitions that we might not insert into again.
> + */
> +#define MAX_BUFFERED_TUPLES  1000
> +#define MAX_BUFFERED_BYTES   65535
> +#define MAX_PARTITION_BUFFERS15

> +/*
> + * CopyMultiInsertBuffer
> + *   Stores multi-insert data related to a single relation in 
> CopyFrom.
> + */
> +typedef struct

Please don't create anonymous structs - they can't be forward declared,
and some debugging tools handle them worse than named structs. And it
makes naming CopyMultiInsertBuffer in the comment less necessary.


> +{
> + Oid relid;  /* Relation ID this 
> insert data is for */
> + TupleTableSlot **slots; /* Array of MAX_BUFFERED_TUPLES to store
> +  * tuples */
> + uint64 *linenos;/* Line # of tuple in copy stream */

It could make sense to allocate those two together, or even as part of
the CopyMultiInsertBuffer itself, to reduce allocator overhead.


> + else
> + {
> + CopyMultiInsertBuffer *buffer = (CopyMultiInsertBuffer *) 
> palloc(sizeof(CopyMultiInsertBuffer));
> +
> + /* Non-partitioned table.  Just setup the buffer for the table. 
> */
> + buffer->relid = RelationGetRelid(rri->ri_RelationDesc);
> + buffer->slots = palloc0(MAX_BUFFERED_TUPLES * 
> sizeof(TupleTableSlot *));
> + buffer->linenos = palloc(MAX_BUFFERED_TUPLES * sizeof(uint64));
> + buffer->resultRelInfo = rri;
> + buffer->bistate = GetBulkInsertState();
> + buffer->nused = 0;
> + miinfo->multiInsertBufferTab = NULL;
> + miinfo->buffer = buffer;
> + miinfo->nbuffers = 1;
> + }

Can this be moved into a helper function?


> + /*
> +  * heap_multi_insert leaks memory, so switch to short-lived memory 
> context
> +  * before calling it.
> +  */

s/heap_multi_insert/table_multi_insert/



> + /*
> +  * If there are any indexes, update them for all the inserted tuples, 
> and
> +  * run AFTER ROW INSERT triggers.
> +  */
> + if (resultRelInfo->ri_NumIndices > 0)
> + {
> + for (i = 0; i < nBufferedTuples; i++)
> + {
> + List   *recheckIndexes;
> +
> + cstate->cur_lineno = buffer->linenos[i];
> + recheckIndexes =
> + ExecInsertIndexTuples(buffer->slots[i], estate, 
> false, NULL,
> +   NIL);
> + ExecARInsertTriggers(estate, resultRelInfo,
> +  
> buffer->slots[i],
> +  
> recheckIndexes, cstate->transition_capture);
> + list_free(recheckIndexes);
> + }
> + }
> +
> + /*
> +  * There's no indexes, but see if we need to run AFTER ROW INSERT 
> triggers
> +  * anyway.
> +  */
> + else if (resultRelInfo->ri_TrigDesc != NULL &&
> +  (resultRelInfo->ri_TrigDesc->trig_insert_after_row ||
> +   

Re: COPY FROM WHEN condition

2019-04-02 Thread Andres Freund
On 2019-04-03 06:41:49 +1300, David Rowley wrote:
> However, I've ended up not doing it that way as the patch requires
> more than just an array of TupleTableSlots to be stored in the
> ResultRelInfo, it'll pretty much need all of what I have in
> CopyMultiInsertBuffer, which includes line numbers for error
> reporting, a BulkInsertState and a counter to track how many of the
> slots are used.  I had thoughts that we could just tag the whole
> CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it
> somewhere a bit more generic. Another thought would be that we have
> something like "void *extra;" in ResultRelInfo that we can just borrow
> for copy.c.  It may be interesting to try this out to see if it saves
> much in the way of performance.

Hm, we could just forwad declare struct CopyMultiInsertBuffer in a
header, and only define it in copy.c. That doesn't sound insane to me.


> I've got the patch into a working state now and wrote a bunch of
> comments. I've not really done a thorough read back of the code again.
> I normally wait until the light is coming from a different angle
> before doing that, but there does not seem to be much time to wait for
> that in this case, so here's v2.  Performance seems to be about the
> same as what I reported yesterday.

Cool.


> +/*
> + * Multi-Insert handling code for COPY FROM.  Inserts are performed in
> + * batches of up to MAX_BUFFERED_TUPLES.

This should probably be moved to the top of the file.


> + * When COPY FROM is used on a partitioned table, we attempt to maintain
> + * only MAX_PARTITION_BUFFERS buffers at once.  Buffers that are completely
> + * unused in each batch are removed so that we end up not keeping buffers for
> + * partitions that we might not insert into again.
> + */
> +#define MAX_BUFFERED_TUPLES  1000
> +#define MAX_BUFFERED_BYTES   65535
> +#define MAX_PARTITION_BUFFERS15

> +/*
> + * CopyMultiInsertBuffer
> + *   Stores multi-insert data related to a single relation in 
> CopyFrom.
> + */
> +typedef struct

Please don't create anonymous structs - they can't be forward declared,
and some debugging tools handle them worse than named structs. And it
makes naming CopyMultiInsertBuffer in the comment less necessary.


> +{
> + Oid relid;  /* Relation ID this 
> insert data is for */
> + TupleTableSlot **slots; /* Array of MAX_BUFFERED_TUPLES to store
> +  * tuples */
> + uint64 *linenos;/* Line # of tuple in copy stream */

It could make sense to allocate those two together, or even as part of
the CopyMultiInsertBuffer itself, to reduce allocator overhead.


> + else
> + {
> + CopyMultiInsertBuffer *buffer = (CopyMultiInsertBuffer *) 
> palloc(sizeof(CopyMultiInsertBuffer));
> +
> + /* Non-partitioned table.  Just setup the buffer for the table. 
> */
> + buffer->relid = RelationGetRelid(rri->ri_RelationDesc);
> + buffer->slots = palloc0(MAX_BUFFERED_TUPLES * 
> sizeof(TupleTableSlot *));
> + buffer->linenos = palloc(MAX_BUFFERED_TUPLES * sizeof(uint64));
> + buffer->resultRelInfo = rri;
> + buffer->bistate = GetBulkInsertState();
> + buffer->nused = 0;
> + miinfo->multiInsertBufferTab = NULL;
> + miinfo->buffer = buffer;
> + miinfo->nbuffers = 1;
> + }

Can this be moved into a helper function?


> + /*
> +  * heap_multi_insert leaks memory, so switch to short-lived memory 
> context
> +  * before calling it.
> +  */

s/heap_multi_insert/table_multi_insert/



> + /*
> +  * If there are any indexes, update them for all the inserted tuples, 
> and
> +  * run AFTER ROW INSERT triggers.
> +  */
> + if (resultRelInfo->ri_NumIndices > 0)
> + {
> + for (i = 0; i < nBufferedTuples; i++)
> + {
> + List   *recheckIndexes;
> +
> + cstate->cur_lineno = buffer->linenos[i];
> + recheckIndexes =
> + ExecInsertIndexTuples(buffer->slots[i], estate, 
> false, NULL,
> +   NIL);
> + ExecARInsertTriggers(estate, resultRelInfo,
> +  
> buffer->slots[i],
> +  
> recheckIndexes, cstate->transition_capture);
> + list_free(recheckIndexes);
> + }
> + }
> +
> + /*
> +  * There's no indexes, but see if we need to run AFTER ROW INSERT 
> triggers
> +  * anyway.
> +  */
> + else if (resultRelInfo->ri_TrigDesc != NULL &&
> +  (resultRelInfo->ri_TrigDesc->trig_insert_after_row ||
> +   

Re: COPY FROM WHEN condition

2019-04-02 Thread David Rowley
On Tue, 2 Apr 2019 at 14:11, Andres Freund  wrote:
> Why do we need that list membership check? If you append the
> ResultRelInfo to the list when creating the ResultRelInfo's slots array,
> you don't need to touch the list after a partition lookup - you know
> it's a member if the ResultRelInfo has a slot array.  Then you only need
> to iterate the list when you want to drop slots to avoid using too much
> memory - and that's also a sequential scan of the hash table in your
> approach, right?

Okay, you're right, we could just determine if we've got a new
ResultRelInfo by the lack of any bulk insert slots being set in it.
However, I've ended up not doing it that way as the patch requires
more than just an array of TupleTableSlots to be stored in the
ResultRelInfo, it'll pretty much need all of what I have in
CopyMultiInsertBuffer, which includes line numbers for error
reporting, a BulkInsertState and a counter to track how many of the
slots are used.  I had thoughts that we could just tag the whole
CopyMultiInsertBuffer in ResultRelInfo, but that requires moving it
somewhere a bit more generic. Another thought would be that we have
something like "void *extra;" in ResultRelInfo that we can just borrow
for copy.c.  It may be interesting to try this out to see if it saves
much in the way of performance.

I've got the patch into a working state now and wrote a bunch of
comments. I've not really done a thorough read back of the code again.
I normally wait until the light is coming from a different angle
before doing that, but there does not seem to be much time to wait for
that in this case, so here's v2.  Performance seems to be about the
same as what I reported yesterday.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


david_tableam_copy_v2.patch
Description: Binary data


Re: Checksum errors in pg_stat_database

2019-04-02 Thread Magnus Hagander
On Tue, Apr 2, 2019 at 8:47 AM Michael Paquier  wrote:

> On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
> > On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier 
> wrote:
> >>  One thing which is not
> >> proposed on this patch, and I am fine with it as a first draft, is
> >> that we don't have any information about the broken block number and
> >> the file involved.  My gut tells me that we'd want a separate view,
> >> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
> >> blck) to be complete.  But that's just for future work.
> >
> > That could indeed be nice.
>
> Actually, backpedaling on this one...  pg_stat_checksums_details may
> be a bad idea as we could finish with one row per broken block.  If
> a corruption is spreading quickly, pgstat would not be able to sustain
> that amount of objects.  Having pg_stat_checksums would allow us to
> plugin more data easily based on the last failure state:
> - last relid of failure
> - last fork type of failure
> - last block number of failure.
> Not saying to do that now, but having that in pg_stat_database does
> not seem very natural to me.  And on top of that we would have an
> extra row full of NULLs for shared objects in pg_stat_database if we
> adopt the unique view approach...  I find that rather ugly.
>

I think that tracking each and every block is of course a non-starter, as
you've noticed.

I'm really not sure how much those three extra fields help, TBH. As I see
it the real usecase for this is automated monitoring and quick-checks of
the kind of "is my db currently broken somewhere", in combination with "did
this occur recently" (for people who have never looked at their stats).

This gives people enough information to know where to go look in the logs.

I mean, what's the actual usecase for tracking relid/fork/block of the
*last* failure only? To monitor and see if it changes? What do I do when I
have 10 failures, and I only know about the last one? (I have to go to the
logs anyway)

I think having the count and hte last time make sense, but I'm very
sceptical about the rest.

I can somewhat agree that splitting it  on a per database level might even
at that be overdoing it. What might actually be more interesting from a
failure-location perspective would be tablespace, rather than any of the
others. Or we could reduce it down to just putting it in pg_stat_bgwriter
and only count global values perhaps, if in the end we don't think the
split-per-database is reasonable?

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Inadequate executor locking of indexes

2019-04-02 Thread Tom Lane
David Rowley  writes:
> On Thu, 14 Mar 2019 at 21:52, Amit Langote
>  wrote:
>> If the correct lock is taken in both cases by the current code, then maybe
>> there's no need to change anything?

> I'm aiming to fix the problem described by Tom when he started this
> thread. It's pretty easy to get an assert failure in master.

Yeah, removing the Assert failure is a minimum requirement.  Whether
we need to do more than that can be debated.

> I think this borderline impossible to fix completely.

After further review I concur with that position.  The cases where
we have lock upgrade hazards are where some subquery uses a table
in a way that requires a stronger lock than is needed for some other
reference that was processed earlier.  It's kind of pointless to
guarantee that we avoid that in the planner or executor if the parser
already hit the problem; and it seems darn near impossible, and
certainly impractical, to avoid the problem while parsing.

(Actually, even if we fixed the parser, I'm not sure we'd be out of
the woods.  Consider a case where a subquery requires AccessShareLock
on some table, but then when we come to expand inheritance in the
planner, we discover that that same table is a child or partition of
some target table, so now we need a stronger lock on it.)

I think therefore that we should forget about the idea of avoiding
lock upgrade hazards, at least for situations where the hazard is
caused by conflicting table references that are all written by the
user.  That's not common, and since none of these lock types are
exclusive, the odds of an actual deadlock are low anyway.

> Maybe you're right about being able to use rellockmode for indexes,
> but I assume that we lowered the lock level for indexes for some
> reason, and this would reverse that.

I kind of think that we *should* use rellockmode for indexes too.
To the extent that we're doing differently from that now, I think
it's accidental not intentional.  It would perhaps have been difficult
to clean that up completely before we added rellockmode, but now that
we've got that we should use it.  I feel that adding a second field
for index lock mode would just be perpetuating some accidental
inconsistencies.

In short, I think we should take the parts of this patch that modify
the index_open calls, but make them use rte->rellockmode; and forget
all the other parts.

BTW, I'd also suggest expanding the header comment for 
ExecRelationIsTargetRelation to explain that it's no longer
used in the core code, but we keep it around because FDWs
might want it.

regards, tom lane




Re: Compressed TOAST Slicing

2019-04-02 Thread Stephen Frost
Greetings,

* Darafei "Komяpa" Praliaskouski (m...@komzpa.net) wrote:
> > I'll plan to push this tomorrow with the above change (and a few
> > additional comments to explain what all is going on..).
> 
> Is everything ok? Can it be pushed?

This has been pushed now.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: [PATCH v22] GSSAPI encryption support

2019-04-02 Thread Stephen Frost
Greetings,

* Robbie Harwood (rharw...@redhat.com) wrote:
> Stephen Frost  writes:
> > * Robbie Harwood (rharw...@redhat.com) wrote:
> >> Stephen Frost  writes:
> >> 
> >> I wanted to note a couple things about this approach.  It now uses
> >> one more buffer than before (in contrast to the previous approach,
> >> which reused a buffer for received data that was encrypted and
> >> decrypted).
> >
> > Yeah, I don't see that as too much of an issue and it certainly seems
> > cleaner and simpler to reason about to me, which seems worth the
> > modest additional buffer cost.  That's certainly something we could
> > change if others feel differently.
> >
> >> Since these are static fixed size buffers, this increases the total
> >> steady-state memory usage by 16k as opposed to re-using the buffer.
> >> This may be fine; I don't know how tight RAM is here.
> >
> > It seems unlikely to be an issue to me- and I would contend that the
> > prior implementation didn't actually take any steps to prevent the
> > other side from sending packets of nearly arbitrary size (up to 1G),
> > so while the steady-state memory usage of the prior implementation was
> > less when everyone was playing nicely, it could have certainly been
> > abused.  I'm a lot happier having an explicit cap on how much memory
> > will be used, even if that cap is a bit higher.
> 
> Yeah, that's definitely a fair point.  We could combine the two
> approaches, but I don't really see a reason to if it's unlikely to be an
> issue - as you say, this is more readable.  It can always be a follow-on
> if needed.

Agreed, we could do that later if it seems like it would be helpful.

> > I did add in a simple pg_stat_gssapi view, modeled on pg_stat_ssl, so
> > that you can check server-side if GSSAPI was used for authentication
> > and/or encryption, and what principal was used if GSSAPI was used for
> > authentication.
> 
> Good idea.

Thanks. :)  I definitely like being able to see from the backend side of
things when a connection is encrypted or not, and being able to see the
principal used for authentication is really nice too.

> >> Again, these are nits, and I think I'm okay with the changes.
> >
> > Great, thanks again for reviewing!
> >
> > Updated patch attached, if you could take another look through it,
> > that'd be great.
> 
> I'm happy with this!  Appreciate you exploring my concerns.

Fantastic, thanks so much for working on this over the years!  Unless
there's any further comments, I'm going to push this tomorrow.

Thanks again!

Stephen


signature.asc
Description: PGP signature


Re: Minimal logical decoding on standbys

2019-04-02 Thread Andres Freund
Hi,

On 2019-04-02 15:26:52 +0530, Amit Khandekar wrote:
> On Thu, 14 Mar 2019 at 15:00, Amit Khandekar  wrote:
> > I managed to get a recovery conflict by :
> > 1. Setting hot_standby_feedback to off
> > 2. Creating a logical replication slot on standby
> > 3. Creating a table on master, and insert some data.
> > 2. Running : VACUUM FULL;
> >
> > This gives WARNING messages in the standby log file.
> > 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> > catalog xmin 474 conflicts with removed xid 477
> > 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> > for Heap2/CLEAN: remxid 477
> >
> > But I did not add such a testcase into the test file, because with the
> > current patch, it does not do anything with the slot; it just keeps on
> > emitting WARNING in the log file; so we can't test this scenario as of
> > now using the tap test.
> 
> I am going ahead with drop-the-slot way of handling the recovery
> conflict. I am trying out using ReplicationSlotDropPtr() to drop the
> slot. It seems the required locks are already in place inside the for
> loop of ResolveRecoveryConflictWithSlots(), so we can directly call
> ReplicationSlotDropPtr() when the slot xmin conflict is found.

Cool.


> As explained above, the only way I could reproduce the conflict is by
> turning hot_standby_feedback off on slave, creating and inserting into
> a table on master and then running VACUUM FULL. But after doing this,
> I am not able to verify whether the slot is dropped, because on slave,
> any simple psql command thereon, waits on a lock acquired on sys
> catache, e.g. pg_authid. Working on it.

I think that indicates a bug somewhere. If replay progressed, it should
have killed the slot, and continued replaying past the VACUUM
FULL. Those symptoms suggest replay is stuck somewhere. I suggest a)
compiling with WAL_DEBUG enabled, and turning on wal_debug=1, b) looking
at a backtrace of the startup process.

Greetings,

Andres Freund




Re: Column lookup in a row performance

2019-04-02 Thread Tom Lane
=?UTF-8?B?0J/QsNCy0LvRg9GF0LjQvSDQmNCy0LDQvQ==?=  writes:
>> (1) Backwards compatibility, and (2) it's not clear that a different
>> layout would be a win for all cases.

> I am curious regarding (2), for my understanding it is good to find
> out at least one case when layout with lengths/offsets in a header
> will be crucially worse. I will be happy if someone can elaborate.

It seems like you think the only figure of merit here is how fast
deform_heap_tuple runs.  That's not the case.  There are at least
two issues:

1.  You're not going to be able to do this without making tuples
larger overall in many cases; but more data means more I/O which
means less performance.  I base this objection on the observation
that our existing design allows single-byte length "words" in many
common cases, but it's really hard to see how you could avoid
storing a full-size offset for each column if you want to be able
to access each column in O(1) time without any examination of other
columns.

2.  Our existing system design has an across-the-board assumption
that each variable-length datum has its length embedded in it,
so that a single pointer carries enough information for any called
function to work with the value.  If you remove the length word
and expect the length to be computed by subtracting two offsets that
are not even physically adjacent to the datum, that stops working.
There is no fix for that that doesn't add performance costs and
complexity.

Practically speaking, even if we were willing to lose on-disk database
compatibility, point 2 breaks so many internal and extension APIs that
there's no chance whatever that we could remove the length-word datum
headers.  That means that the added fields in tuple headers would be
pure added space with no offsetting savings in the data size, making
point 1 quite a lot worse.

regards, tom lane




Re: patch to allow disable of WAL recycling

2019-04-02 Thread Jerry Jelinek
On Mon, Apr 1, 2019 at 7:48 PM Thomas Munro  wrote:

> On Sat, Mar 30, 2019 at 11:18 AM Jerry Jelinek 
> wrote:
> > I went through your new version of the patch and it all looks great to
> me.
>
> I moved the error handling logic around a bit so we'd capture errno
> immediately after the syscalls.  I also made a couple of further
> tweaks to comments and removed some unnecessary casts.
>
> I suspect we took so long on this because of lack of ZFS knowledge and
> uncertainty about the precise reason for the current coding in terms
> of crash safety in general.  After learning more, I now suspect the
> claim about fsyncdata(2) and indirect blocks in the comments may be
> incorrect (it may stem from buggy behaviour on older Linux kernels),
> but I'm not sure and it's not this patch's job to change that.
>
> Pushed.  Thanks for the patch!


 Thanks a lot for getting this integrated, and thanks to everyone else who
gave me so much feedback and assistance during this process!
Jerry


Re: partitioned tables referenced by FKs

2019-04-02 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Mar-29, Robert Haas wrote:
>> I don't know why dependency.c doesn't handle this internally.  If I
>> say that I want to delete a list of objects, some of which can't be
>> dropped without dropping certain other things, dependency.c ought to
>> be able to suspend judgement on what the problems are until it's
>> looked over that whole list.  It seems to me that we've had to fight
>> with this problem before, and I don't know why every individual bit of
>> code that calls into that file has to be responsible for working
>> around it individually.

> Hmm, if you think this kind of situation is more widespread than this
> particular case, then maybe we can add the hooks I was talking about,
> and simplify those other cases while fixing this problem.  Do you have
> it in your mind what cases are those?  Nothing comes to mind, but I'll
> search around and see if I can find anything.

FWIW, I think that the dependency mechanism is designed around the idea
that whether it's okay to drop a *database object* depends only on what
other *database objects* rely on it, and that you can always make a DROP
valid by dropping all the dependent objects.  That's not an unreasonable
design assumption, considering that the SQL standard embeds the same
assumption in its RESTRICT/CASCADE syntax.

The problem this patch is running into is that we'd like to make the
validity of dropping a table partition depend on whether there are
individual rows in some other table that FK-reference rows in the target
partition.  That's just completely outside the object-dependency model,
and I'm not sure why Robert finds that surprising.

I think the stopgap solution of requiring a separate DETACH step first
is probably fine, although I don't see any documentation explaining that
in the v10 patch?  Also the test cases in the v10 patch don't seem to
square with that reality, or at least their comments need work.

In the long run I wouldn't necessarily object to having some sort of
hooks in dependency.c to allow this type of use-case to be addressed.
But it'd likely be a good idea to have at least one other concrete
example in mind before trying to design a mechanism for that.

regards, tom lane




Re: "could not reattach to shared memory" on buildfarm member dory

2019-04-02 Thread Tom Lane
Noah Misch  writes:
> I can reproduce the 4 MiB allocations described
> in https://postgr.es/m/29823.1525132...@sss.pgh.pa.us; a few times per
> "vcregress check", they emerge in the middle of PGSharedMemoryReAttach().
> The 4 MiB allocations are stacks for new threads of the default thread
> pool[1].

Hah!  It is great to finally have an understanding of what is happening
here.

I worry that your proposed fix is unstable, in particular this assumption
seems shaky:

> + * ... The idea is that, if the allocator handed out
> + * REGION1 pages before REGION2 pages at one occasion, it will do so whenever
> + * both regions are free.

I wonder whether it's possible to address this by configuring the "default
thread pool" to have only one thread?  It seems like the extra threads are
just adding backend startup overhead to no benefit, since we won't use 'em.

regards, tom lane




Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Michael Banck
Hi,

Am Dienstag, den 02.04.2019, 15:56 +0900 schrieb Michael Paquier:
> Regarding all this tooling around checksums.  With v12, enabling
> checksums with no actual downtime is doable with a primary-standby
> deployment using physical replication and one planned failover

Can you explain in more detail how this would work? I thought we came to
the conclusion (and the documentation seems to indicate so), that you
should stop all participating instances of a cluster and then enable
checksums on all of them, which would impose a downtime.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz




Re: "could not reattach to shared memory" on buildfarm member dory

2019-04-02 Thread Noah Misch
On Sun, Dec 02, 2018 at 09:35:06PM -0800, Noah Misch wrote:
> On Tue, Sep 25, 2018 at 08:05:12AM -0700, Noah Misch wrote:
> > On Mon, Sep 24, 2018 at 01:53:05PM -0400, Tom Lane wrote:
> > > Overall, I agree that neither of these approaches are exactly attractive.
> > > We're paying a heck of a lot of performance or complexity to solve a
> > > problem that shouldn't even be there, and that we don't understand well.
> > > In particular, the theory that some privileged code is injecting a thread
> > > into every new process doesn't square with my results at
> > > https://www.postgresql.org/message-id/15345.1525145612%40sss.pgh.pa.us
> > > 
> > > I think our best course of action at this point is to do nothing until
> > > we have a clearer understanding of what's actually happening on dory.
> > > Perhaps such understanding will yield an idea for a less painful fix.
> 
> Could one of you having a dory login use
> https://live.sysinternals.com/Procmon.exe to capture process events during
> backend startup?  The ideal would be one capture where startup failed reattach
> and another where it succeeded, but having the successful run alone would be a
> good start.

Joseph Ayers provided, off-list, the capture from a successful startup.  It
wasn't materially different from the one my system generates, so I abandoned
that line of inquiry.  Having explored other aspects of the problem, I expect
the attached fix will work.  I can reproduce the 4 MiB allocations described
in https://postgr.es/m/29823.1525132...@sss.pgh.pa.us; a few times per
"vcregress check", they emerge in the middle of PGSharedMemoryReAttach().  On
my system, there's 5.7 MiB of free address space just before UsedShmemSegAddr,
so the 4 MiB allocation fits in there, and PGSharedMemoryReAttach() does not
fail.  Still, it's easy to imagine that boring variations between environments
could surface dory's problem by reducing that free 5.7 MiB to, say, 3.9 MiB.

The 4 MiB allocations are stacks for new threads of the default thread
pool[1].  (I confirmed that by observing their size change when I changed
StackReserveSize in MSBuildProject.pm and by checking all stack pointers with
"thread apply all info frame" in gdb.)  The API calls in
PGSharedMemoryReAttach() don't cause the thread creation; it's a timing
coincidence.  Commit 2307868 would have worked around the problem, but
pg_usleep() is essentially a no-op on Windows before
pgwin32_signal_initialize() runs.  (I'll push Assert(pgwin32_signal_event) to
some functions.)  While one fix is to block until all expected threads have
started, that could be notably slow, and I don't know how to implement it
cleanly.  I think a better fix is to arrange for the system to prefer a
different address space region for these thread stacks; for details, see the
first comment the patch adds to win32_shmem.c.  This works here.

> backend startup sees six thread creations:
> 
> 1. main thread
> 2. thread created before postgres.exe has control
> 3. thread created before postgres.exe has control
> 4. thread created before postgres.exe has control
> 5. in pgwin32_signal_initialize()
> 6. in src\backend\port\win32\timer.c:setitimer()
> 
> Threads 2-4 exit exactly 30s after creation.  If we fail to reattach to shared
> memory, we'll exit before reaching code to start 5 or 6.

Threads 2-4 proved to be worker threads of the default thread pool.

[1] https://docs.microsoft.com/en-us/windows/desktop/ProcThread/thread-pools
diff --git a/src/backend/port/win32/signal.c b/src/backend/port/win32/signal.c
index dd91fe5..7d8961f 100644
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index bc1e946..eed17c5 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -17,6 +17,28 @@
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
 
+/*
+ * Early in a process's life, Windows asynchronously creates threads for the
+ * process's "default thread pool"
+ * (https://docs.microsoft.com/en-us/windows/desktop/ProcThread/thread-pools).
+ * Occasionally, thread creation allocates a stack after
+ * PGSharedMemoryReAttach() has released UsedShmemSegAddr and before it has
+ * mapped shared memory at UsedShmemSegAddr.  This would cause mapping to fail
+ * if the allocator preferred the just-released region for allocating the new
+ * thread stack.  We observed such failures in some Windows Server 2016
+ * configurations.  To give the system another region to prefer, reserve and
+ * release an additional, protective region immediately before reserving or
+ * releasing shared memory.  The idea is that, if the allocator handed out
+ * REGION1 pages before REGION2 pages at one occasion, it will do so whenever
+ * both regions are free.  Windows Server 2016 exhibits that behavior, and a
+ * system behaving differently would have less need to protect
+ * UsedShmemSegAddr.  The protective region must be at least large enough for
+ * one thread stack.  However, ten times as much is less than 2% of the 

Re: partitioned tables referenced by FKs

2019-04-02 Thread Jesper Pedersen

Hi,

On 4/1/19 4:03 PM, Alvaro Herrera wrote:

I'm satisfied with this patch now, so I intend to push early tomorrow.



Passes check-world, and my tests.

Best regards,
 Jesper




Re: [HACKERS] generated columns

2019-04-02 Thread Erik Rijkers

On 2019-04-02 14:43, Peter Eisentraut wrote:

On 2019-04-01 10:52, Peter Eisentraut wrote:

On 2019-03-31 05:49, Erik Rijkers wrote:

STORED in a
file_fdw foreign table still silently creates the column which then
turns out to be useless on SELECT, with an error like:

"ERROR:  column some_column_name is a generated column
DETAIL:  Generated columns cannot be used in COPY."

Maybe it'd be possible to get an error earlier, i.e., while trying to
create such a useless column?


I'll look into it.


I've been trying to create a test case for file_fdw for this, but I'm
not getting your result.  Can you send a complete test case?



Ah, I had not noticed before: with an asterisk ('select * from table' ) 
one gets no error, just empty values.


An actual error seems to occur when one mentions the 
generated-column-name explicitly in the select-list.


select "id", "Ratio Log2 GEN" from ;
"
ERROR:  column "Ratio Log2 GEN" is a generated column
DETAIL:  Generated columns cannot be used in COPY.
"

That's from a quick test here at work; maybe that gives you enough info.

If that doesn't make it repeatable (for you) I'll make a more complete 
example this evening (from home).






Re: [PROPOSAL] Temporal query processing with range types

2019-04-02 Thread Ibrar Ahmed
I start looking at the patch, there is a couple of problems with the patch. The 
first one is the OID conflict, which I fixed on my machine. The second problem 
is assertion failure. I think you have not compiled the PostgreSQL code with 
the assertion. 

...
postgres=# SELECT *
FROM (projects p1 NORMALIZE projects p2 USING() WITH(t,t)) p_adjusted;
TRAP: FailedAssertion("!(ptr == ((void *)0) || (((const Node*)(ptr))->type) == 
type)", File: "../../../src/include/nodes/nodes.h", Line: 588)
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-04-02 
12:50:09.654 UTC [27550] LOG:  server process (PID 27559) was terminated by 
signal 6: Aborted
...

Although this patch is WIP, but please avoid mix declaration to avoid the 
compiler warning message.

...
joinpath.c:993:3: warning: ISO C90 forbids mixed declarations and code 
[-Wdeclaration-after-statement]
   PathTarget *target_split = makeNode(PathTarget);
...

I am still looking at the patch.

Re: [HACKERS] generated columns

2019-04-02 Thread Peter Eisentraut
On 2019-04-01 10:52, Peter Eisentraut wrote:
> On 2019-03-31 05:49, Erik Rijkers wrote:
>> STORED in a 
>> file_fdw foreign table still silently creates the column which then 
>> turns out to be useless on SELECT, with an error like:
>>
>> "ERROR:  column some_column_name is a generated column
>> DETAIL:  Generated columns cannot be used in COPY."
>>
>> Maybe it'd be possible to get an error earlier, i.e., while trying to 
>> create such a useless column?
> 
> I'll look into it.

I've been trying to create a test case for file_fdw for this, but I'm
not getting your result.  Can you send a complete test case?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Ordered Partitioned Table Scans

2019-04-02 Thread Amit Langote
Hi David,

On Tue, Apr 2, 2019 at 8:49 PM David Rowley
 wrote:
> I ended up rewording the entire thing and working on the header
> comment for the function too. I think previously it wasn't that well
> defined what "ordered" meant. I added a mention that we expect that
> NULLs, if possible must come in the last partition.

Thanks for the updated patch.

New descriptions look good, although was amused by this:

diff --git a/src/backend/partitioning/partbounds.c
b/src/backend/partitioning/partbounds.c
index bdd0d23854..9dd378d7a0 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -25,6 +25,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
...
+partitions_are_ordered(struct RelOptInfo *partrel)

Maybe, "struct" is unnecessary?

Thanks,
Amit




Re: Unified logging system for command-line programs

2019-04-02 Thread Peter Eisentraut
On 2019-04-02 05:05, Michael Paquier wrote:
> I am a bit disappointed that this stuff is not reducing
> the amount of diff code with ifdef FRONTEND in src/common/.

That wasn't the purpose of the patch.  If you have a concrete proposal
for how to do what you describe, it would surely be welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unified logging system for command-line programs

2019-04-02 Thread Peter Eisentraut
On 2019-04-01 20:55, Andres Freund wrote:
> A written upthread, I think this should have had a uniform interface
> with elog.h, and probably even share some code between the two.

The purpose of this patch was to consolidate the existing zoo of logging
routines in the frontend programs.  That has meaningful benefits.  There
is hardly any new code; most of the code was just consolidated from
existing scattered code.

If someone wants to take it further and consolidate that with the
backend logging infrastructure, they are free to propose a patch for
consideration.  Surely the present patch can only help, since it already
makes the call sites uniform, which would presumably have to be done
anyway.  However, there is no prototype or even detailed design sketch
let alone someone committing to implement such a project.  So it seems
unreasonable to block other meaningful improvements in adjacent areas in
the meantime.

> This is
> going in the wrong direction, making it harder, not easier, to share
> code between frontend and backend.

I don't think anything has changed in that respect.  If there is reason
to believe that code that uses fprintf() is easier to share with the
backend than alternatives, then nothing is standing in the way of
continuing to use that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Problems with plan estimates in postgres_fdw

2019-04-02 Thread Etsuro Fujita

(2019/03/29 22:18), Etsuro Fujita wrote:

Attached is a
new version. If there are no objections, I'll commit this version.


Pushed after polishing the patchset a bit further: add an assertion, 
tweak comments, and do cleanups.  Thanks for reviewing, Antonin and Jeff!


Best regards,
Etsuro Fujita





Re: Ordered Partitioned Table Scans

2019-04-02 Thread David Rowley
On Tue, 2 Apr 2019 at 14:26, Amit Langote  wrote:
> How about extending the sentence about when the optimization is disabled
> as follows:
>
> "If we find a Datum in a partition that's greater than one previously
> already seen, then values could become out of order and we'd have to
> disable the optimization.  We'd also need to disable the optimization if
> NULL values are interleaved with other Datum values, because the calling
> code expect them to be present in the last partition."
>
> Further, extend the "For now..." sentence as:
>
> "For now, let's just keep it simple and just accept LIST partitioned table
> without a DEFAULT partition where each partition only accepts a single
> Datum or NULL.  It's OK to always accept NULL partition in that case,
> because PartitionBoundInfo lists it as the last partition."

I ended up rewording the entire thing and working on the header
comment for the function too. I think previously it wasn't that well
defined what "ordered" meant. I added a mention that we expect that
NULLs, if possible must come in the last partition.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


mergeappend_to_append_conversion_v18.patch
Description: Binary data


Re: GCoS2019--pgBackRest port to Windows (2019)

2019-04-02 Thread David Steele

Hello Youssef,

On 4/1/19 5:16 PM, Youssef Khedher wrote:

Hello,

I’m Youssef Khedher, 19 years old student from Tunisia.

I’m a computer science student in Tunisia, and also an online student 
for Harvard CS50 Program.


I’m interested in the ” pgBackRest port to Windows (2019)”.


Excellent!

> To be frank with you, I’m not an expert in C

You'll need to be able to understand and modify many areas of the C code 
in order to be successful in this project.  I would encourage you to 
review the code and make sure you are able to follow what it's doing in 
general before submitting an application:


https://github.com/pgbackrest/pgbackrest/tree/master/src

I don't think you'll be writing a bunch of new code, but anything is 
possible when porting software.


In order to achieve the project goal, all I need from you MR. Stephen 
Frost or MR. David Steele is some help to understand perfectly the task 
and to help me enter to world of software development.


For this project it's important to have knowledge about how Windows 
differs from Unix, e.g. the way child processes are spawned.  In the 
core code we only use fork() with an immediate exec(), so this should be 
straightforward enough to port:


https://github.com/pgbackrest/pgbackrest/blob/master/src/command/archive/get/get.c#L235

We also use fork() quite a bit in our testing and I was thinking the 
HARNESS_FORK*() macros could be enhanced to use threads instead on Windows:


https://github.com/pgbackrest/pgbackrest/blob/master/test/src/common/harnessFork.h

If not, then the tests will need to be adjusted to accommodate whatever 
testing method is developed.


pgBackRest uses SSH to communicate with remote processes:

https://github.com/pgbackrest/pgbackrest/blob/master/src/protocol/helper.c#L288

Eventually we would like to move away from requiring SSH, but for this 
port I think the best idea would be to get pgBackRest working with some 
open source SSH solution such as OpenSSH (which is easily installed on 
recent versions of Windows, but not sure about older versions).  If 
there is time at the end we might look at alternate solutions.


There may be other minor areas in the code that need be adjusted or 
#ifdef'd to work with Windows.  We've tried to keep this to a minimum by 
enforcing C99 and Posix standards, but there will be some differences. 
The config code that enforces Unix path structure is an obvious area 
that will need to be updated:


https://github.com/pgbackrest/pgbackrest/blob/master/src/config/parse.c#L1034

Note that we want to port to native Windows without the presence of 
Cygwin (or similar) in production.  My preference would be to use 
something like Strawberry Perl for testing, and then as few dependencies 
as possible for the production distribution.


A CI testing platform for Windows will need to be selected -- mostly 
likely AppVeyor.


The documentation will also need to be updated for Windows.

You should delve into the areas mentioned above and propose possible 
solutions when writing your proposal.  Feel free to ask questions.


Porting code from one platform to another can be quite complicated, but 
we believe this project can be accomplished over the summer by a skilled 
and motivated student.


If you are interested in proceeding, you should create an issue here:

https://github.com/pgbackrest/pgbackrest/issues

We do our development on Github and issues are the way we discuss 
projects, enhancements, and bugs.


Good luck!
--
-David
da...@pgmasters.net




Re: [HACKERS] WAL logging problem in 9.4.3?

2019-04-02 Thread Kyotaro HORIGUCHI
Thank you for reviewing.

At Sun, 31 Mar 2019 15:31:58 -0700, Noah Misch  wrote in 
<20190331223158.gb891...@rfd.leadboat.com>
> On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > +/*
> > > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > > + */
> > > +void
> > > +RelationRemovePendingSync(Relation rel)
> > 
> > What is the coding rule for deciding when to call this?  Currently, only
> > ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> > much like ALTER TABLE SET TABLESPACE behaves.
> 
> This question still applies.  (The function name did change from
> RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

It is called for heap_register_sync()'ed relations to avoid
syncing useless or trying to sync nonexistent files. I modifed
all CLUSTER, COPY FROM, CREATE AS, REFRESH MATVIEW and SET
TABLESPACE uses the function. (The function is renamed to
table_relation_invalidate_walskip()).

I noticed that heap_register_sync and friends are now a kind of
Table-AM function. So I added .relation_register_walskip and
.relation_invalidate_walskip in TableAMRoutine and moved the
heap_register_sync stuff as heapam_relation_register_walskip and
friends. .finish_bulk_insert() is modified to be used only
WAL-skip is active on the relation. (0004, 0005) But I'm not sure
that is the right direction.

(RelWALRequirements is renamed to RelWALSkip)

The change made smgrFinishBulkInsert (known as smgrDoPendingSync)
need to call a tableam interface. Relation is required to call it
in the designed way but relcache cannot live until there. In the
attached patch 0005, a new member TableAmRoutine *tableam is
added to RelWalSkip and calls finish_bulk_insert() via the
tableAm. But I'm quite uneasy with that...

> On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> > At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch  wrote in 
> > <20190321054835.gb3842...@rfd.leadboat.com>
> Again, I do want them in the code.  Please restore them, but use a mechanism
> like CACHE_elog() so they're built only if one defines a preprocessor symbol.

Ah, sorry. I restored the messages using STORAGE_elog(). I also
needed this. (SMGR_ might be better but I'm not sure.)

> On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> > +   smgrProcessWALRequirementInval(s->subTransactionId, false);
> 
> The smgrProcessWALRequirementInval() calls almost certainly belong in
> CommitSubTransaction() and AbortSubTransaction(), not in these functions.  By
> doing it here, you'd get the wrong behavior in a subtransaction created via a
> plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

Thanks. Moved it to AtSubAbort_smgr() and AtSubCommit_smgr(). (0005)

> > +/*
> > + * Process pending invalidation of WAL requirements happened in the
> > + * subtransaction
> > + */
> > +void
> > +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> > +{
> > +   HASH_SEQ_STATUS status;
> > +   RelWalRequirement *walreq;
> > +
> > +   if (!walRequirements)
> > +   return;
> > +
> > +   /* We expect that we don't have walRequirements in almost all cases */
> > +   hash_seq_init(, walRequirements);
> > +
> > +   while ((walreq = hash_seq_search()) != NULL)
> > +   {
> > +   /* remove useless entry */
> > +   if (isCommit ?
> > +   walreq->invalidate_sxid == sxid :
> > +   walreq->create_sxid == sxid)
> > +   hash_search(walRequirements, >relnode, 
> > HASH_REMOVE, NULL);
> 
> Do not remove entries during subtransaction commit, because a parent
> subtransaction might still abort.  See other CommitSubTransaction() callees
> for examples of correct subtransaction handling.  AtEOSubXact_Files() is one
> simple example.

Thanks. smgrProcessWALSkipInval() (0005) is changed so that:

 - If a RelWalSkip entry is created in aborted subtransaction,
   remove it.

 - If a RelWalSkip entry is created then invalidated in committed
   subtransaction, remove it.

 - If a RelWalSkip entry is created and committed, change the
   creator subtransaction to the parent subtransaction.

 - If a RelWalSkip entry is create elsewhere and invalidated in
   committed subtransaction, move the invalidation to the parent
   subtransaction.

 - If a RelWalSkip entry is created elsewhere and invalidated in
   aborted subtransaction, cancel the invalidation.

Test is added as test3a2 and test3a3. (0001)

> > @@ -3567,15 +3602,26 @@ heap_update
> >  */
> > if (RelationIsAccessibleInLogicalDecoding(relation))
> > {
> > -   log_heap_new_cid(relation, );
> > -   log_heap_new_cid(relation, heaptup);
> > +   if (oldbuf_needs_wal)
> > +   log_heap_new_cid(relation, );
> > +   if (newbuf_needs_wal)
> > +   

Re: Refactoring the checkpointer's fsync request queue

2019-04-02 Thread Thomas Munro
On Wed, Mar 27, 2019 at 5:48 AM Shawn Debnath  wrote:
> On Tue, Mar 26, 2019 at 09:22:56AM -0400, Robert Haas wrote:
> > On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro  
> > wrote:
> > > Here's a fixup patch.  0001 is the same as Shawn's v12 patch, and 0002
> > > has my proposed changes to switch to callbacks that actually perform
> > > the sync and unlink operations given a file tag, and do so via the
> > > SMGR fd cache, rather than exposing the path to sync.c.  This moves us
> > > back towards the way I had it in an earlier version of the patch, but
> > > instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
> > > sync handler function lookup table, allowing for non-smgr components
> > > to use this machinery in future (as requested by Andres).
> >
> > Strong +1.  Not only might closing and reopening the files have
> > performance and reliability implications, but a future smgr might talk
> > to the network, having no local file to sync.
>
> Makes sense for now. When we re-visit the fd-passing or sync-on-close
> implementations, we can adapt the changes relatively easily given the
> rest of the framework is staying intact. I am hoping these patches do
> not delay the last fsync-gate issue discussion further.

I found a few more things that I thought needed adjustment:

* Packing handler and request type into a uint8 is cute but a waste of
time if we're just going to put it in a struct next to a member that
requires word-alignment.  So I changed it to a pair of plain old int16
members.  The ftag member starts at offset 4 either way, on my system.

* I didn't really like the use of the word HIERARCHY in the name of
the request type, and changed it to SYNC_FILTER_REQUEST.  That word
came up because we were implementing a kind of hierarchy, where if you
drop a database you want to forget things for all segments inside all
relations inside that database, but the whole point of this new API is
that it doesn't understand that, it calls a filter function to decide
which requests to keep.  So I preferred "filter" as a name for the
type of request.

* I simplified the "matches" callback interface.

* Various typos and comment clean-up.

I'm going to do some more testing and tidying tomorrow (for example I
think the segment.h header is silly and I'd like to remove that), and
commit this.

-- 
Thomas Munro
https://enterprisedb.com


0001-Refactor-the-fsync-queue-for-future-SMGR-impleme-v14.patch
Description: Binary data


Re: Minimal logical decoding on standbys

2019-04-02 Thread Amit Khandekar
On Thu, 14 Mar 2019 at 15:00, Amit Khandekar  wrote:
> I managed to get a recovery conflict by :
> 1. Setting hot_standby_feedback to off
> 2. Creating a logical replication slot on standby
> 3. Creating a table on master, and insert some data.
> 2. Running : VACUUM FULL;
>
> This gives WARNING messages in the standby log file.
> 2019-03-14 14:57:56.833 IST [40076] WARNING:  slot decoding_standby w/
> catalog xmin 474 conflicts with removed xid 477
> 2019-03-14 14:57:56.833 IST [40076] CONTEXT:  WAL redo at 0/3069E98
> for Heap2/CLEAN: remxid 477
>
> But I did not add such a testcase into the test file, because with the
> current patch, it does not do anything with the slot; it just keeps on
> emitting WARNING in the log file; so we can't test this scenario as of
> now using the tap test.

I am going ahead with drop-the-slot way of handling the recovery
conflict. I am trying out using ReplicationSlotDropPtr() to drop the
slot. It seems the required locks are already in place inside the for
loop of ResolveRecoveryConflictWithSlots(), so we can directly call
ReplicationSlotDropPtr() when the slot xmin conflict is found.

As explained above, the only way I could reproduce the conflict is by
turning hot_standby_feedback off on slave, creating and inserting into
a table on master and then running VACUUM FULL. But after doing this,
I am not able to verify whether the slot is dropped, because on slave,
any simple psql command thereon, waits on a lock acquired on sys
catache, e.g. pg_authid. Working on it.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: minimizing pg_stat_statements performance overhead

2019-04-02 Thread Christoph Berg
Re: Raymond Martin 2019-04-01 

> Thanks again Fabien. I am attaching the patch to this email in the hope of 
> getting it approved during the next commit fest. 

Raymond,

you sent the patch as UTF-16, could you re-send it as plain ascii?

Christoph




Re: [Patch] Mingw: Fix import library extension, build actual static libraries

2019-04-02 Thread Sandro Mani
Any chance these patches could be considered?

Thanks
Sandro

On Thu, Mar 7, 2019 at 3:23 PM Sandro Mani  wrote:

> Currently building postgresql for Win32 with a mingw toolchain produces
> import libraries with *.a extension, whereas the extension should be
> *.dll.a. There are various downstream workarounds for this, see i.e. [1]
> and [2]. The attached patch 0001-Fix-import-library-extension.patch
> addresses this.
>
> Related, no actual static libraries are produced alongside the respective
> dlls. The attached patch 0002-Build-static-libraries.patch addresses this,
> in a similar fashion as is already done for the AIX case in Makefile.shlib.
>
> Thanks Sandro
>
> [1]
> https://src.fedoraproject.org/rpms/mingw-postgresql/blob/master/f/mingw-postgresql.spec#_144
> [2]
> https://aur.archlinux.org/cgit/aur.git/tree/0001-Use-.dll.a-as-extension-for-import-libraries.patch?h=mingw-w64-postgresql
>


Re: partitioned tables referenced by FKs

2019-04-02 Thread Amit Langote
Hi Alvaro,

On 2019/04/02 5:03, Alvaro Herrera wrote:
> On 2019-Mar-29, Jesper Pedersen wrote:
> 
>> I ran my test cases for this feature, and havn't seen any issues.
>>
>> Therefore I'm marking 1877 as Ready for Committer. If others have additional
>> feedback feel free to switch it back.
> 
> Thanks!
> 
> I found two issues today.  One, server side, is that during cloning for
> partition attach we were not checking for concurrent deletion of
> referenced tuples in partitions.  I added an isolation spec test for
> this.  To fix the bug, added a find_all_inheritors() to lock all
> partitions with ShareRowExclusiveLock.
> 
> Another is that psql's \d failed for versions < 12, because we were
> inconditionally adding an "AND conparentid = 0" clause.
> 
> I also reworked CloneForeignKeyConstraints.  The previous style was
> being forced by the old recursing method; now we can make it a lot
> simpler -- it's now just two subroutine calls.
> 
> I'm satisfied with this patch now, so I intend to push early tomorrow.

I tried the latest patch and found no problems as far as I tried.

I guess it's OK for the time being that users will not be able to drop a
partition while a foreign key is pointing to it, given the troubles with
that pre-DROP check.  (Of course, someone may want to improve dependency.c
in the future so that we can allow the DROP where possible.)

Thanks for working on this.

Off-topic:

I did some performance testing, which has no relation to the code being
changed by this patch, but thought the numbers hint at a need for
improving other areas that affect the scalability (in terms of number of
partitions) of the foreign key checks.

* Table setup:

Create a hash partitioned table with N partitions, followed by a regular
table referencing the aforementioned partitioned table.  Insert keys
1-1000 into ht, the primary key table.

N: 2..8192

drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(MODULUS N, REMAINDER ' || (x)::text || ');' from generate_series(0, N-1) x;
\gexec
drop table foo cascade;
create table foo (a int references ht);
truncate ht cascade;
insert into ht select generate_series(1, 1000);

* pgbench script (insert-foo.sql)

\set a random(1, 1000)
insert into foo values (:a);

* Run pgbench

pgbench -n -T 120 -f insert-foo.sql

* TPS with plan_cache_mode = auto (default) or force_custom_plan

2   2382.779742 <=
8   2392.514952
32  2392.682178
128 2387.828361
512 2358.325865
10242234.699889
40962030.610062
81921740.633117

* TPS with plan_cache_mode = force_generic_plan

2   2924.030727 <=
8   2854.460937
32  2378.630989
128 1550.235166
512 642.980148
1024330.142430
409671.739272
819232.620403

It'd be good to work on improving the scalability of generic plan
execution in the future, because not having to re-plan the query used by
RI_FKey_check would always be better, as seen by comparing execution times
for 2 partitions (2382 tps with custom plan vs. 2924 tps with generic
plan.).  Having run-time pruning already helps quite a lot, but some other
ideas are needed, such as one proposed by David recently, but withdrawn
for PG 12 [1].

Anyway, nothing here that prevents this patch from being committed. :)

Thanks for reading.

Regards,
Amit

[1] https://commitfest.postgresql.org/22/1897/





Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Fabien COELHO




For pg_checksums, probably some improvement patch will be submitted later,
if someone feels like it.


Let's see.  I think that what we have now in v12 is good enough for
checksum operations on an offline cluster.  And my take is that we
should focus more on online checksum verification for v13 and future
versions.


I agree.

For online, we should want throttling so that the check can have a reduced 
performance impact when scrubbing.


--
Fabien.




RE: Transaction commits VS Transaction commits (with parallel) VS query mean time

2019-04-02 Thread Jamison, Kirk
On Thursday, March 28, 2019 3:13 PM (GMT+9), Haribabu Kommi wrote:
> I tried the approach as your suggested as by not counting the actual parallel 
> work
> transactions by just releasing the resources without touching the counters,
> the counts are not reduced much.
>
> HEAD - With 4 parallel workers running query generates 13 stats ( 4 * 3  + 1)
> Patch -  With 4 parallel workers running query generates 9 stats ( 4 * 2 + 1)
> Old approach patch - With 4 parallel workers running query generates 1 stat 
> (1)
>
> Currently the parallel worker start transaction 3 times in the following 
> places.
> 1. InitPostgres
> 2. ParallelWorkerMain (2)
>
> with the attached patch, we reduce one count from ParallelWorkerMain.

I'm sorry for the late review of the patch.
Patch applies, compiles cleanly, make check passes too.
I tested the recent patch again using the same method above
and confirmed the increase of generated stats by 9 w/ 4 parallel workers.

postgres=# BEGIN;
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
  60
(1 row)
postgres=# explain analyze select * from tab where b = 6;
[snipped]
postgres=# COMMIT;
postgres=# select xact_commit from pg_stat_database where datname = 'postgres';
xact_commit
-
  69

The intention of the latest patch is to fix the stat of
(IOW, do not count) parallel cooperating transactions,
or the transactions started by StartParallelWorkerTransaction,
based from the advice of Amit.
After testing, the current patch works as intended.

However, also currently being discussed in the latter mails
is the behavior of how parallel xact stats should be shown.
So the goal eventually is to improve how we present the
stats for parallel workers by differentiating between the
internal-initiated (bgworker xact, autovacuum, etc) and
user/application-initiated transactions, apart from keeping
the overall xact stats.
..So fixing the latter one will change this current thread's fix?

I personally have no problems with committing this fix first
before fixing the latter problem discussed above.
But I think there should be a consensus regarding that one.

Regards,
Kirk Jamison


GCoS2019--pgBackRest port to Windows (2019)

2019-04-02 Thread Youssef Khedher
Hello,
I’m Youssef Khedher, 19 years old student from Tunisia.
I’m a computer science student in Tunisia, and also an online student for 
Harvard CS50 Program.
I’m interested in the ” pgBackRest port to Windows (2019)”.
I have a good knowledge in C and in IT in general, I believe that nothing is 
impossible and that I can handle and understand this project 
even if I have to stay 24h/24h. I love challenges and I’m someone who never 
stops not matter what is the obstacle is.
Working in an open source project with a well-known organization like 
PostgreSQL is a pleasure and a dream for every young 
student. That’s why I’m fully motivated to the challenge with you. 
To be frank with you, I’m not an expert in C,  but I’m kind of people who never 
give up until the problem is solved on my own.
In order to achieve the project goal, all I need from you MR. Stephen Frost or 
MR. David Steele is some help to understand perfectly the task and to help me 
enter to world of software development. 
I’m feeling comfortable to express what’s in my head honestly because 
PostgreSQL said that their mentors are wonderfully kind and friendly people who 
want you to learn and succeed!
Thanks for your time. 
Looking forward to your response!

Sincerely,
 " YOUSSEF  KHEDHER "

Computer-Science Student. 
Phone:(+216) 25 460 276 
Address : 28 Rue Amir Abedlkader, Jemmal 5020, Monastir, Tunisia.




RE: Planning counters in pg_stat_statements (using pgss_store)

2019-04-02 Thread legrand legrand
Hi,

>>
>> case  avg_tps   pct_diff
>> 089 278   --
>> 188 745   0,6%
>> 288 282   1,1%
>> 386 660   2,9%
>>
>> This means that even in this extrem test case, the worst degradation is less
>> than 3%
>> (this overhead can be removed using pg_stat_statements.track_planning guc)

> Is the difference between 2 and 3 the extraneous pgss_store call to
> always store the query text if planner hook doesn't have access to the
> query text?

Yes it is,
but I agree it seems a big gap (1,8%) compared to the difference between 1 and 
2 (0,5%).
Maybe this is just mesure "noise" ...

Regards
PAscal


Re: New vacuum option to do only freezing

2019-04-02 Thread Darafei Praliaskouski
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

I have read this patch. I like the concept and would like it to get committed.

Question I have after reading the patch is around this construct:

/*
-* If there are no indexes then we can vacuum the page right now
-* instead of doing a second scan.
+* If there are no indexes we can vacuum the page right now 
instead of
+* doing a second scan. Also we don't do that but forget dead 
tuples
+* when index cleanup is disabled.
 */

This seems to change behavior on heap tuples, even though the option itself is 
documented to be about "Indexes" only. This needs either better explanation 
what "forget dead tuples" means and that it does not lead to some kind of 
internal inconsistency, or documentation on what is the effect on heap tuples.

This same block raises a question on "after I enable this option, do a vacuum, 
decide I don't like it, what do I need to do to disable it back?" - just set it 
back, or set and perform a vacuum, or set and perform a VACUUM FULL as 
something was "forgotten"?

It may happen this concept of "forgetting" is documented somewhere in the near 
comments but I'd prefer it to be stated explicitly.

The new status of this patch is: Waiting on Author


Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 08:11:45AM +0200, Fabien COELHO wrote:
> Nope, this was way before I intervened. ISTM that a patch was submitted to
> get per second or slower progress reporting on the initalization, and it was
> rejected. Now that there are many SSD, maybe I could bring it back. An issue
> with pbbench is that the reported time and progress is for the insertion
> phase only, but PK and other FK declaration take as much time and are not
> included, so I'm not sure it can be much better.

Bonus idea: integrate automatically with pg_stat_progress_vacuum() for
the last vacuum at the end of initialization.  This step can also take
a very long time and one can only know the progress of the VACUUM with
a second session/terminal.

> For pg_checksums, probably some improvement patch will be submitted later,
> if someone feels like it.

Let's see.  I think that what we have now in v12 is good enough for
checksum operations on an offline cluster.  And my take is that we
should focus more on online checksum verification for v13 and future
versions.

Regarding all this tooling around checksums.  With v12, enabling
checksums with no actual downtime is doable with a primary-standby
deployment using physical replication and one planned failover
(possible as well with v10 and newer versions and logical replication
but that takes longer), so I think that there is not much value in
being able to enable checksums on a single node while it is online.
--
Michael


signature.asc
Description: PGP signature


Re: Checksum errors in pg_stat_database

2019-04-02 Thread Michael Paquier
On Tue, Apr 02, 2019 at 07:43:12AM +0200, Julien Rouhaud wrote:
> On Tue, Apr 2, 2019 at 6:56 AM Michael Paquier  wrote:
>>  One thing which is not
>> proposed on this patch, and I am fine with it as a first draft, is
>> that we don't have any information about the broken block number and
>> the file involved.  My gut tells me that we'd want a separate view,
>> like pg_stat_checksums_details with one tuple per (dboid, rel, fork,
>> blck) to be complete.  But that's just for future work.
> 
> That could indeed be nice.

Actually, backpedaling on this one...  pg_stat_checksums_details may
be a bad idea as we could finish with one row per broken block.  If
a corruption is spreading quickly, pgstat would not be able to sustain
that amount of objects.  Having pg_stat_checksums would allow us to
plugin more data easily based on the last failure state:
- last relid of failure
- last fork type of failure
- last block number of failure.
Not saying to do that now, but having that in pg_stat_database does
not seem very natural to me.  And on top of that we would have an
extra row full of NULLs for shared objects in pg_stat_database if we
adopt the unique view approach...  I find that rather ugly.

>> No need for two ifs here.  What about just that?
>> if (NULL)
>> PG_RETURN_NULL();
>> else
>> PG_RETURN_TIMESTAMPTZ(last_checksum_failure);
> 
> I do agree, but this is done like this everywhere in pgstatfuncs.c, so
> I think it's better to keep it as-is for consistency.

Okay, this is not an issue for me.

The patch looks fine to me as-is.  Let's see what Magnus or others have to
say about it.  I can take care of this open item if necessary but
that's not my commit so I'd rather not step on Magnus' toes.
--
Michael


signature.asc
Description: PGP signature


Re: Compressed TOAST Slicing

2019-04-02 Thread Komяpa
Hi!


> I'll plan to push this tomorrow with the above change (and a few
> additional comments to explain what all is going on..).


Is everything ok? Can it be pushed?

I'm looking here, haven't found it pushed and worry about this.
https://github.com/postgres/postgres/commits/master



-- 
Darafei Praliaskouski
Support me: http://patreon.com/komzpa


Re: Progress reporting for pg_verify_checksums

2019-04-02 Thread Fabien COELHO


Bonjour Michaël,


Maybe it should be -P X where X is the expected
delay in seconds. Pgbench progress reporting on initialization basically
outputs 10 rows per second, probably it is too much.


I cannot say for pgbench.  I personally think that's a lot but you are
the one who wrote it as such I guess.


Nope, this was way before I intervened. ISTM that a patch was submitted to 
get per second or slower progress reporting on the initalization, and it 
was rejected. Now that there are many SSD, maybe I could bring it back. An 
issue with pbbench is that the reported time and progress is for the 
insertion phase only, but PK and other FK declaration take as much time 
and are not included, so I'm not sure it can be much better.


For pg_checksums, probably some improvement patch will be submitted later, 
if someone feels like it.


--
Fabien.

Re: Pluggable Storage - Andres's take

2019-04-02 Thread Haribabu Kommi
On Tue, Apr 2, 2019 at 11:53 AM Andres Freund  wrote:

> Hi,
>
> On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > > What made you rename indexam.sgml to am.sgml, instead of creating a
> > > separate tableam.sgml?  Seems easier to just have a separate file?
> > >
> >
> > No specific reason, I just thought of adding all the access methods under
> > one file.
> > I can change it to tableam.sgml.
>
> I'd rather keep it separate. It seems likely that both table and indexam
> docs will grow further over time, and they're not that closely
> related. Additionally not moving sect1->sect2 etc will keep links stable
> (which we could also achieve with different sect1 names, I realize
> that).
>

OK.


> > I can understand your point of avoiding function-by-function API
> reference,
> > as the user can check directly the code comments, Still I feel some
> people
> > may refer the doc for API changes. I am fine to remove based on your
> > opinion.
>
> I think having to keeping both tableam.h and the sgml file current is
> too much overhead - and anybody that's going to create a new tableam is
> going to be able to look into the source anyway.
>

Here I attached updated patches as per the discussion.
Is the description of table access methods is enough? or do you want me to
add some more details?

Regards,
Haribabu Kommi
Fujitsu Australia


0001-tableam-doc-update-of-table-access-methods.patch
Description: Binary data


0002-Doc-updates-for-pluggable-table-access-method-syntax.patch
Description: Binary data