date:20181113

Re: DSM segment handle generation in background workers

2018-11-13 Thread Noah Misch

On Wed, Nov 14, 2018 at 08:22:42PM +1300, Thomas Munro wrote:
> On Wed, Nov 14, 2018 at 6:34 PM Noah Misch  wrote:
> > On Wed, Nov 14, 2018 at 05:50:26PM +1300, Thomas Munro wrote:
> > > On Wed, Nov 14, 2018 at 3:24 PM Noah Misch  wrote:
> > > > What counts is the ease of predicting a complete seed.  HEAD's 
> > > > algorithm has
> > > > ~13 trivially-predictable bits, and the algorithm that stood in 
> > > > BackendRun()
> > > > from 98c5065 until 197e4af had no such bits.  You're right that the 
> > > > other 19
> > > > bits are harder to predict than any given 19 bits under the old 
> > > > algorithm, but
> > > > the complete seed remains more predictable than it was before 197e4af.
> > >
> > > However we mix them, given that the source code is well known, isn't
> > > an attacker's job really to predict the time and pid, two not
> > > especially well guarded secrets?
> >
> > True.  Better to frame the issue as uniform distribution of seed, not
> > unpredictability of seed selection.
> 
> What do you think about the attached?

You mentioned that you rewrote the algorithm because the new function had a
TimestampTz.  But the BackendRun() code, which it replaced, also had a
TimestampTz.  You can reuse the exact algorithm.  Is there a reason to change?

Re: Restore CurrentUserId only if 'prevUser' is valid when abort transaction

2018-11-13 Thread Michael Paquier

On Tue, Nov 13, 2018 at 04:08:04PM -0500, Tom Lane wrote:
> Well, if it is an issue then it's an issue for back branches too.
> It's fine, I think, as long as the warning stays a warning ...
> but there are lots of ways in which trying to print a warning
> might turn into an error.

At the end I have agreed with this position, and patched back-branches
so as they use an assertion instead of the elog().  So committed.
--
Michael


signature.asc
Description: PGP signature

Re: DSM segment handle generation in background workers

2018-11-13 Thread Thomas Munro

On Wed, Nov 14, 2018 at 6:34 PM Noah Misch  wrote:
> On Wed, Nov 14, 2018 at 05:50:26PM +1300, Thomas Munro wrote:
> > On Wed, Nov 14, 2018 at 3:24 PM Noah Misch  wrote:
> > > What counts is the ease of predicting a complete seed.  HEAD's algorithm 
> > > has
> > > ~13 trivially-predictable bits, and the algorithm that stood in 
> > > BackendRun()
> > > from 98c5065 until 197e4af had no such bits.  You're right that the other 
> > > 19
> > > bits are harder to predict than any given 19 bits under the old 
> > > algorithm, but
> > > the complete seed remains more predictable than it was before 197e4af.
> >
> > However we mix them, given that the source code is well known, isn't
> > an attacker's job really to predict the time and pid, two not
> > especially well guarded secrets?
>
> True.  Better to frame the issue as uniform distribution of seed, not
> unpredictability of seed selection.

What do you think about the attached?

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Increase-the-number-of-possible-random-seeds-per-tim.patch
Description: Binary data

Re: Undo logs

2018-11-13 Thread Dilip Kumar

On Mon, Nov 5, 2018 at 5:09 PM Dilip Kumar  wrote:
>
> On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar  wrote:
>
> Thomas has already posted the latest version of undo log patches on
> 'Cleaning up orphaned files using undo logs' thread[1].  So I have
> rebased the undo-interface patch also.  This patch also includes
> latest defect fixes from the main zheap branch [2].
>
Hi Thomas,

The latest patch for undo log storage is not compiling on the head, I
think it needs to be rebased due to your commit related to "pg_pread()
and pg_pwrite() for data files and WAL"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-11-13 Thread Peter Geoghegan

On Tue, Nov 13, 2018 at 1:29 PM Peter Geoghegan  wrote:
> A solution does occur to me that I'm kind of embarrassed to suggest,
> but that would nonetheless probably do the job:

> Why not vary the objsubid value among entries that don't use it
> anyway, so that they have a negative integer objsubid values rather
> than 0? That would have the effect of forcing the scan order of
> pg_depend_depender_index scans to be whatever we want it to be.

I tried to implement this today, and it almost worked. However, what I
came up with was even more brittle than I thought it would be, because
knowledge of objsubid is all over the place. It's even in pg_dump.

Then I had a better idea: add a monotonically decreasing column to
pg_depend, and make it the last attribute in both
pg_depend_depender_index and pg_depend_reference_index (it has to be
both indexes). This makes almost all the regression test output go
back to the master branch output, since index-wise duplicates are now
accessed in a well defined order: reverse chronological order (the
reverse of the order of the creation of a pg_depend entry). There is a
tiny amount of stuff that still has a different order in the
regression test output, but the changes are trivial, and obviously
still correct.

Is this approach really so bad? I might want to find a way to make it
a monotonically increasing DESC column (there is a question around bki
grammar support for that), but that's just a detail. Some scheme
involving a new column seems like it will work reasonably well. It's
much more maintainable than anything else I'm likely to come up with.

The indexes are actually smaller with the change following a run of
the regression tests, provided the entire patch series is applied. So
even if the only thing you care about is system catalog size, you
still win (the table is slightly bigger, though). This approach will
fix most or all of the test flappiness that we have been papering over
with "\set VERBOSITY terse" before now. I can't promise that just
making this one pg_depend change will reliably make that class of
problem go away forever, since you could still see something like that
elsewhere, but it looks like the vast majority of problem cases
involve pg_depend. So we might just end up killing the "\set VERBOSITY
terse" hack completely this way.

We're already relying on the scan order being in reverse chronological
order, so we might as well formalize the dependency. I don't think
that it's possible to sort the pg_depend entries as a way of fixing
the breakage while avoiding storing this extra information -- what is
there to sort on that's there already? You'd have to infer a whole
bunch of things about the object types associated with pg_depend
entries to do that, and teach dependency.c about its callers. That
seems pretty brittle to me.

-- 
Peter Geoghegan

Re: Sync ECPG scanner with core

2018-11-13 Thread John Naylor

On 11/14/18, Tom Lane  wrote:
> Looks like good stuff, pushed with a small additional amount of work.

Thanks.

>> -Does ECPG still need its own functions for mm_alloc() and
>> mm_strdup(), if we have equivalents in src/common?
>
> The error handling isn't the same, so I don't think we can just replace
> them with the src/common functions.  It is, however, potentially worth
> our trouble to fix things so that within pgc.l, palloc() and pstrdup()
> are macros for mm_alloc/mm_strdup, thereby further reducing the diffs
> to the core lexer.
>
> I'd be pretty tempted to also change all the hard references to
> base_yylval to go through a pointer variable, so that diffs like
> this go away:
>
> -yylval->str = pstrdup(yytext);
> +base_yylval.str = mm_strdup(yytext);
>
> I believe that that'd result in more compact code, too --- on most
> machines, references to global variables are more expensive in
> code bytes than indirecting through a local pointer variable.

I'll look into that as time permits.

On 11/14/18, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> On 2018-Nov-13, Tom Lane wrote:
>>> More generally, I wonder if it'd be practical to make pgc.l backup-free
>>> altogether.  I'm not sure that the resulting gain in lexing speed would
>>> really be of interest to anyone,
>
>> Maybe time to compile the ecpg test cases during "make check-world"?
>
> I'm dubious that the lexer is a significant part of that, though I could
> be wrong ...

If it were, it'd be much easier to try a Flex flag other than the
default, most compact representation, something like -Cfe. That'd be a
prerequisite for no-backtracking to matter anyway. That's easy enough
to test, so I'll volunteer to do that sometime.

-John Naylor

Re: DSM segment handle generation in background workers

2018-11-13 Thread Noah Misch

On Wed, Nov 14, 2018 at 05:50:26PM +1300, Thomas Munro wrote:
> On Wed, Nov 14, 2018 at 3:24 PM Noah Misch  wrote:
> > On Mon, Nov 12, 2018 at 09:39:01AM -0500, Tom Lane wrote:
> > > I doubt that's a good idea; to a first approximation, it would mean that
> > > half the seed depends only on the PID and the other half only on the
> > > timestamp.  Maybe we could improve matters a little by left-shifting the
> > > PID four bits or so, but I think we still want it to mix with some
> > > rapidly-changing time bits.
> > >
> > > I'm not really sure that we need to do anything though.  Basically,
> > > what we've got here is a tradeoff between how many bits change over
> > > a given timespan and how unpredictable those bits are.  I don't see
> > > that one of those is necessarily more important than the other.
> >
> > What counts is the ease of predicting a complete seed.  HEAD's algorithm has
> > ~13 trivially-predictable bits, and the algorithm that stood in BackendRun()
> > from 98c5065 until 197e4af had no such bits.  You're right that the other 19
> > bits are harder to predict than any given 19 bits under the old algorithm, 
> > but
> > the complete seed remains more predictable than it was before 197e4af.
> 
> However we mix them, given that the source code is well known, isn't
> an attacker's job really to predict the time and pid, two not
> especially well guarded secrets?

True.  Better to frame the issue as uniform distribution of seed, not
unpredictability of seed selection.

Incidentally, possible future work may be to use pg_strong_random() when
available, like pgbench set_random_seed() does.  That would achieve both
unpredictability and uniform distribution.  It would be mere defense in depth;
if unpredictability matters, one still needs a CSPRNG (e.g. pgcrypto).

Re: Race condition in WaitForBackgroundWorkerStartup

2018-11-13 Thread Amit Kapila

On Tue, Nov 13, 2018 at 11:48 PM Tomas Vondra
 wrote:
>
> On Tue, 2018-11-13 at 07:57 -0600, Jeremy Finzel wrote:
> > ...
> >
> > I'm unclear on what counts as "worker initialization".  The error is
> > happening in the worker_spi_main function, not in the
> > worker_spi_launch function.  So does an immediate error in
> > worker_spi_main count as part of the worker initialization?
> >

Yeah, sort of.   Any error you get there can lead to the results you
are seeing.  If you want a more robust mechanism, you need to write
something like what we have done for parallel workers (See
WaitForParallelWorkersToAttach.).  Basically, you need your workers to
reach a specific state before which if there is any error, the main
backend should error out.

>
> I don't think it does (or should). worker_spi_main is pretty much the
> worker body, and worker initialization pretty much means just "we've
> initialized the worker process (including tracking it in shmem etc.)
> and invoked it's main function".
>
> I'd also argue the behavior is expected to depend on the machine to
> some extent - depending on speed and load the machine may hit the error
> before/after the GetBackgroundWorkerPid call, producing different
> results.
>
> So I'd say the behavior seems correct, at least from this POV.
>

+1.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: DSM segment handle generation in background workers

2018-11-13 Thread Thomas Munro

On Wed, Nov 14, 2018 at 3:24 PM Noah Misch  wrote:
> On Mon, Nov 12, 2018 at 09:39:01AM -0500, Tom Lane wrote:
> > Thomas Munro  writes:
> > > On Mon, Nov 12, 2018 at 9:34 PM Noah Misch  wrote:
> > >> Compared to the old code, the new code requires more wall time to visit 
> > >> every
> > >> possible seed value.  New code xor's the PID bits into the 
> > >> fastest-changing
> > >> timestamp bits, so only about twenty bits can vary within any given 
> > >> one-second
> > >> period.  (That assumes a PID space of twenty or fewer bits; fifteen bits 
> > >> is
> > >> the Linux default.)  Is that aspect of the change justified?
> >
> > > Hmm, right.  How about applying pg_bswap32() to one of the terms, as
> > > an easy approximation of reversing the bits?
>
> That seems adequate, yes.  But why not just restore the old algorithm in the
> new function?  (I'm not opposed to revising the algorithm, but I think a
> revision should have explicit goals.  The ease of pg_bswap32() is not itself a
> worthy goal, because the old BackendRun() algorithm was also quite easy.)

Ok.  I rewrote it only because in the new coding I had a TimestampTz
rather than the separate sec and usec components of the old code.  I
didn't intend to revise the algorithm fundamentally, but I missed the
significance of the bit shifting from commit 98c50656c that moved the
faster moving bits up.  I'll change it back.

> > I doubt that's a good idea; to a first approximation, it would mean that
> > half the seed depends only on the PID and the other half only on the
> > timestamp.  Maybe we could improve matters a little by left-shifting the
> > PID four bits or so, but I think we still want it to mix with some
> > rapidly-changing time bits.
> >
> > I'm not really sure that we need to do anything though.  Basically,
> > what we've got here is a tradeoff between how many bits change over
> > a given timespan and how unpredictable those bits are.  I don't see
> > that one of those is necessarily more important than the other.
>
> What counts is the ease of predicting a complete seed.  HEAD's algorithm has
> ~13 trivially-predictable bits, and the algorithm that stood in BackendRun()
> from 98c5065 until 197e4af had no such bits.  You're right that the other 19
> bits are harder to predict than any given 19 bits under the old algorithm, but
> the complete seed remains more predictable than it was before 197e4af.

However we mix them, given that the source code is well known, isn't
an attacker's job really to predict the time and pid, two not
especially well guarded secrets?  I don't see how the mixing matters
in that respect.  (You can also just clobber the seed from SQL, but
that'd be cheating.)

> Goal I recommend: if connections arrive in a Poisson distribution of unknown
> lambda, maximize the number of distinct seeds.

Yeah.  I think bit reversing one of them achieves the maximum number
of distinct seeds by putting the busy ends far apart, but the original
bit shifting is similar.

-- 
Thomas Munro
http://www.enterprisedb.com

In-place updates and serializable transactions

2018-11-13 Thread Kuntal Ghosh

Hello hackers,

Currently, we're working on the serializable implementations for
zheap. As mentioned in README-SSI documentation[1], there is one
difference in SSI implementation of PostgreSQL that can differentiate
the conflict detection behaviour with other storage engines that
supports updates in-place.

The heap storage in PostgreSQL does not use "update in place" with a
rollback log for its MVCC implementation.  Where possible it uses
"HOT" updates on the same page (if there is room and no indexed value
is changed). For non-HOT updates the old tuple is expired in place and
a new tuple is inserted at a new location.  Because of this
difference, a tuple lock in PostgreSQL doesn't automatically lock any
other versions of a row. We can take the following example from the
doc to understand the situation in more detail:

T1 ---rw---> T2 ---ww--->T3

If transaction T1 reads a row version (thus acquiring a predicate lock
on it) and a second transaction T2 updates that row version (thus
creating a rw-conflict graph edge from T1 to T2), must a third
transaction T3 which re-updates the new version of the row also have a
rw-conflict in from T1 to prevent anomalies?  In other words,  does it
matter whether we recognize the edge T1 --rw--> T3? The document also
includes a nice proof for why we don't try to copy or expand a tuple
lock to any other versions of the row or why we don't have to
explicitly recognize the edge T1 --rw--> T3.

In PostgreSQL, the predicate locking is implemented using the tuple
id. In zheap, since we perform updates in-place, we don't change the
tuple id. So, in the above example, we easily recognize the edge
T1--rw--> T3. This may increase the number of false positives for
certain cases. In the above example, if we introduce another
transaction T4 such that T3 --rw--> T4 and T4 gets committed first,
for zheap, T3 will be rolled back because of the dangerous structure
T1 --rw--> T3 --rw--> T4. But, for heap, T3 can be committed(isolation
test case [2]). IMHO, this seems to be an acceptable behavior.

In brief, due to in-place updates, in some cases, the false positives
may increase for serializable transactions. Any thoughts?

[1] src/backend/storage/lmgr/README-SSI
[2] src/test/isolation/specs/multiple-row-versions.spec
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-13 Thread Amit Langote

Thanks for updating the patch.

On 2018/11/14 13:16, David Rowley wrote:
> Thanks for looking at this again.
> 
> On 14 November 2018 at 13:47, Amit Langote
>  wrote:
>> +if (dispatchidx >= proute->dispatch_allocsize)
>> +{
>> +/* Expand allocated space. */
>> +proute->dispatch_allocsize *= 2;
>> +proute->partition_dispatch_info = (PartitionDispatchData **)
>> +repalloc(proute->partition_dispatch_info,
>> + sizeof(PartitionDispatchData *) *
>> + proute->dispatch_allocsize);
>> +}
>>
>> Sorry, I forgot to point this out before, but can this code in
>> ExecInitPartitionDispatchInfo be accommodated in
>> ExecCheckPartitionArraySpace() for consistency?
> 
> I don't really want to put that code in ExecCheckPartitionArraySpace()
> as the way the function is now, it makes quite a lot of sense for the
> compiler to inline it. If we add redundant work in there, then it
> makes less sense.  There's never any need to check both arrays at once
> as we're only adding the new item to one array at a time.
> 
> Instead, I've written a new function named
> ExecCheckDispatchArraySpace() and put the resize code inside that.

Okay, seems fine.

> I've fixed the typos you mentioned. The only other thing I changed was
> to only allocate the PartitionDispatch->tupslot if a conversion is
> required.  The previous code allocated this regardless if it was going
> to be used or not.  This saves both the redundant allocation and also
> very slightly reduces the cost of the if test in ExecFindPartition().
> There's now no need to check if the map ! NULL as if the slot is there

Also makes sense.

Although it seems that Alvaro has already started at looking at this, I'll
mark the CF entry as Ready for Committer anyway, because I don't have any
more comments. :)

Thanks,
Amit

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-13 Thread David Rowley

Thanks for looking at this again.

On 14 November 2018 at 13:47, Amit Langote
 wrote:
> +if (dispatchidx >= proute->dispatch_allocsize)
> +{
> +/* Expand allocated space. */
> +proute->dispatch_allocsize *= 2;
> +proute->partition_dispatch_info = (PartitionDispatchData **)
> +repalloc(proute->partition_dispatch_info,
> + sizeof(PartitionDispatchData *) *
> + proute->dispatch_allocsize);
> +}
>
> Sorry, I forgot to point this out before, but can this code in
> ExecInitPartitionDispatchInfo be accommodated in
> ExecCheckPartitionArraySpace() for consistency?

I don't really want to put that code in ExecCheckPartitionArraySpace()
as the way the function is now, it makes quite a lot of sense for the
compiler to inline it. If we add redundant work in there, then it
makes less sense.  There's never any need to check both arrays at once
as we're only adding the new item to one array at a time.

Instead, I've written a new function named
ExecCheckDispatchArraySpace() and put the resize code inside that.

I've fixed the typos you mentioned. The only other thing I changed was
to only allocate the PartitionDispatch->tupslot if a conversion is
required.  The previous code allocated this regardless if it was going
to be used or not.  This saves both the redundant allocation and also
very slightly reduces the cost of the if test in ExecFindPartition().
There's now no need to check if the map ! NULL as if the slot is there
then the map must exist too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v17-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
Description: Binary data

v17-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patch
Description: Binary data

Re: [HACKERS] WAL logging problem in 9.4.3?

2018-11-13 Thread Kyotaro HORIGUCHI

Hello.

At Thu, 11 Oct 2018 17:04:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20181011.170453.123148806.horiguchi.kyot...@lab.ntt.co.jp>
> At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20181011.134235.218062184.horiguchi.kyot...@lab.ntt.co.jp>
> I refactored getPendingSyncEntry out of RecordPendingSync,
> BufferNeedsWAL and RelationTruncate. And split the second patch
> into infrastracture-side and user-side ones. I expect it makes
> reviewing far easier.
> 
> I reaplce RelationNeedsWAL in a part of code added in
> heap_update() by bfa2ab56bb.
> 
> - v3-0001-TAP-test-for-copy-truncation-optimization.patch
> 
>  TAP test
> 
> -v3-0002-Write-WAL-for-empty-nbtree-index-build.patch
> 
>  nbtree "fix"
> 
> - v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch
> 
>  Pending-sync infrastructure.
> 
> - v3-0004-Fix-WAL-skipping-feature.patch
> 
>  Actual fix of WAL skipping feature.

0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.

reagrds.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 666d27dbc47c9963e5098904ffb9b173effaf853 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 11 Oct 2018 16:18:04 +0900
Subject: [PATCH 4/4] Fix WAL skipping feature.

This patch repalces WAL-skipping means from HEAP_INSERT_SKIP_WAL to
pending-sync tracking infrastructure.
---
 src/backend/access/heap/heapam.c| 71 ++---
 src/backend/access/heap/pruneheap.c |  3 +-
 src/backend/access/heap/rewriteheap.c   |  3 --
 src/backend/access/heap/visibilitymap.c |  3 +-
 src/backend/commands/copy.c | 13 +++---
 src/backend/commands/createas.c |  9 ++---
 src/backend/commands/matview.c  |  6 +--
 src/backend/commands/tablecmds.c|  5 +--
 src/backend/commands/vacuumlazy.c   |  6 +--
 src/include/access/heapam.h |  9 ++---
 10 files changed, 73 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7caa3ec248..a68eae9b11 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@
  *	  the POSTGRES heap access method used for all POSTGRES
  *	  relations.
  *
+ * WAL CONSIDERATIONS
+ *	  All heap operations are normally WAL-logged. but there are a few
+ *	  exceptions. Temporary and unlogged relations never need to be
+ *	  WAL-logged, but we can also skip WAL-logging for a table that was
+ *	  created in the same transaction, if we don't need WAL for PITR or
+ *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *	  the file to disk at COMMIT instead.
+ *
+ *	  The same-relation optimization is not employed automatically on all
+ *	  updates to a table that was created in the same transacton, because
+ *	  for a small number of changes, it's cheaper to just create the WAL
+ *	  records than fsyncing() the whole relation at COMMIT. It is only
+ *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *	  operation; it will cause any subsequent updates to the table to skip
+ *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+ *	  COMMIT.
+ *
+ *	  To make that work, all modifications to heap must use
+ *	  BufferNeedsWAL() to check if WAL-logging is needed in this transaction
+ *	  for the given block.
+ *
  *-
  */
 #include "postgres.h"
@@ -2414,12 +2436,6 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2455,6 +2471,7 @@ ReleaseBulkInsertStatePin(BulkInsertState bistate)
  * TID where the tuple was stored.  But note that any toasting of fields
  * within the tuple data is NOT reflected into *tup.
  */
+
 Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate)
@@ -2528,7 +2545,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (BufferNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2730,7 +2747,6 @@ heap_multi_insert(Relation relation,

Ltree Extension Enhancements

2018-11-13 Thread Denny Herssein


Hi All,

First, kudos to the creators of  Ltree, a very useful tool.

My labels have dots, blanks, apostrophes, commas and other special 
characters.


I was wondering if there is a way to modify ltree to allow labels with 
special characters - including blanks and dots! -


Also, I would like to use "|" as the separator (instead of ".").   
Ideally, Ltree would allow a user defined character to serve as the 
separator.


Daniel

Re: DSM segment handle generation in background workers

2018-11-13 Thread Noah Misch

On Mon, Nov 12, 2018 at 09:39:01AM -0500, Tom Lane wrote:
> Thomas Munro  writes:
> > On Mon, Nov 12, 2018 at 9:34 PM Noah Misch  wrote:
> >> Compared to the old code, the new code requires more wall time to visit 
> >> every
> >> possible seed value.  New code xor's the PID bits into the fastest-changing
> >> timestamp bits, so only about twenty bits can vary within any given 
> >> one-second
> >> period.  (That assumes a PID space of twenty or fewer bits; fifteen bits is
> >> the Linux default.)  Is that aspect of the change justified?
> 
> > Hmm, right.  How about applying pg_bswap32() to one of the terms, as
> > an easy approximation of reversing the bits?

That seems adequate, yes.  But why not just restore the old algorithm in the
new function?  (I'm not opposed to revising the algorithm, but I think a
revision should have explicit goals.  The ease of pg_bswap32() is not itself a
worthy goal, because the old BackendRun() algorithm was also quite easy.)

> I doubt that's a good idea; to a first approximation, it would mean that
> half the seed depends only on the PID and the other half only on the
> timestamp.  Maybe we could improve matters a little by left-shifting the
> PID four bits or so, but I think we still want it to mix with some
> rapidly-changing time bits.
> 
> I'm not really sure that we need to do anything though.  Basically,
> what we've got here is a tradeoff between how many bits change over
> a given timespan and how unpredictable those bits are.  I don't see
> that one of those is necessarily more important than the other.

What counts is the ease of predicting a complete seed.  HEAD's algorithm has
~13 trivially-predictable bits, and the algorithm that stood in BackendRun()
from 98c5065 until 197e4af had no such bits.  You're right that the other 19
bits are harder to predict than any given 19 bits under the old algorithm, but
the complete seed remains more predictable than it was before 197e4af.

Goal I recommend: if connections arrive in a Poisson distribution of unknown
lambda, maximize the number of distinct seeds.

nm

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Michael Paquier

On Tue, Nov 13, 2018 at 09:39:55PM +0100, Tomas Vondra wrote:
> So -1 to this from me. I'd just swap the fields in the copy dir message
> and call it a day.

+1.  Using the opposite order when what is actually a database path is
kind of confusing..
--
Michael


signature.asc
Description: PGP signature

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-13 Thread Haribabu Kommi

On Wed, Nov 14, 2018 at 12:26 AM Amit Kapila 
wrote:

> On Tue, Nov 13, 2018 at 11:32 AM Haribabu Kommi
>  wrote:
> >
> > On Mon, Nov 12, 2018 at 6:34 PM Amit Kapila 
> wrote:
> >> > I can revert it back to void,
> >> >
> >>
> >> +1, as we don't see any good reason to break backward compatibility.
> >
> >
> > Thanks for the review.
> > Attached the updated patch with return type as void.
> >
>
> With this patch, we are intending to remove the entries matching
> userid, dbid, queryid from hash table (pgss_hash), but the contents of
> the file (
> pgss_query_texts.stat) will remain unchanged unless all of the entries
> are removed from hash table.  This appears fine to me, especially
> because there is no easy way to remove the contents from the file.
> Does anybody see any problem with this behavior?
>

Adding more info to the above point, even if the file contents are not
removed, later if the file size and number of pg_stat_statements entries
satisfy the garbage collection, the file will be truncated. So I feel not
removing of the contents when the query stats are reset using specific
parameters is fine.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: [Bug Fix]ECPG: cancellation of significant digits on ECPG

2018-11-13 Thread Tom Lane

"Higuchi, Daisuke"  writes:
> From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
>> So I think that we ought to unconditionally make the sqlda value's digit 
>> buffer look just like the one we're copying, even when ndigits = 0, 
>> which just requires removing the tests on ndigits.

> I agree with you. Seeing this thread[1], 'if (ndigits = 0)' was introduced 
> only to avoid memcpy() crash. I do not know this solution was best or not, 
> but no crash occurs in the current version. So, I also think 'if (ndigits = 
> 0)' should be removed.

Hmmm ... looking at PGTYPESnumeric_from_asc, it seems like the current
behavior is different from what was described in that old thread; the only
case where a digit buffer wouldn't be created is a NaN.  But maybe a crash
could occur for NaN.  Perhaps we should use "if (num->sign !=
NUMERIC_NAN)" as a guard?

regards, tom lane

Re: move PartitionBoundInfo creation code

2018-11-13 Thread Amit Langote

Hi,

On 2018/11/13 22:59, Alvaro Herrera wrote:
> On 2018-Nov-13, Michael Paquier wrote:
> 
>> On Tue, Nov 13, 2018 at 09:58:08AM -0300, Alvaro Herrera wrote:
>>> "the context that was active when the function was called" is typically
>>> expressed simply as "the current memory context".  Perhaps the whole
>>> phrase can be reduced to "The object returned by this function is wholly
>>> allocated in the current memory context"?

Yeah, don't know why I had to put it in such convoluted manner.

>> Yes, that looks cleaner.  I am planning to do a last lookup round on
>> tomorrow morning my time before committing, so I may still tweak a
>> couple of other words...
> 
> Cool.

Thanks Michael.

> I gave the patch a read and it looks reasonable to me.
> 
> Memory management in RelationBuildPartitionDesc is crummy, but I don't
> think it's this patch's fault.

Are you perhaps referring to the discussion about partitioning cache
management we had a few months back, but decided to postpone any
development work to PG 12?  The following thread, maybe:

https://www.postgresql.org/message-id/143ed9a4-6038-76d4-9a55-502035815e68%40lab.ntt.co.jp

Thanks,
Amit

Re: move PartitionBoundInfo creation code

2018-11-13 Thread Michael Paquier

On Tue, Nov 13, 2018 at 10:59:15AM -0300, Alvaro Herrera wrote:
> I gave the patch a read and it looks reasonable to me.
> 
> Memory management in RelationBuildPartitionDesc is crummy, but I don't
> think it's this patch's fault.

I agree, and that shows up pretty clearly after the refactoring is done.
The mess is partially caused by the handling around the case where there
is no partition data to attach to PartitionDescData.  I would personally
much prefer if we could also avoid using partition_bounds_copy.

It does not prevent the first refactoring step, so I have committed the
patch as it is already doing a lot.
--
Michael

signature.asc
Description: PGP signature

RE: [Bug Fix]ECPG: cancellation of significant digits on ECPG

2018-11-13 Thread Higuchi, Daisuke

From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
>I took a quick look at this. 
Thank you for review. 

>So I think that we ought to unconditionally make the sqlda value's digit 
>buffer look just like the one we're copying, even when ndigits = 0, 
>which just requires removing the tests on ndigits.

I agree with you. Seeing this thread[1], 'if (ndigits = 0)' was introduced only 
to avoid memcpy() crash. I do not know this solution was best or not, but no 
crash occurs in the current version. So, I also think 'if (ndigits = 0)' should 
be removed.

>In short, the attached.
Thank you for updating. This patch looks good to me. 

[1]
https://www.postgresql.org/message-id/4EC825F3.5080504%40cybertec.at

Regards,
Daisuke Higuchi

Re: Copy data to DSA area

2018-11-13 Thread Thomas Munro

On Tue, Nov 13, 2018 at 10:59 PM Ideriha, Takeshi
 wrote:
> Can I check my understanding?
> The situation you are talking about is the following:
> Data structure A and B will be allocated on DSA. Data A has a pointer to B.
> Another data X is already allocated on some area (might be on local heap) and 
> has a pointer variable X->a,
> which will be linked to A.
>
>  old_context = MemoryContextSwitchTo(dsa_memory_context);
>  A = palloc();
>  B = palloc();
>  A->b = B;
>  X->a = A;
>  MemoryContextSwitchTo(old_context);
>
> If error happens in the way of this flow, palloc'd data (A & B) should be 
> freed
> and pointer to freed data (X->a) should be back to its original one.
> So handling these cases introduces an "transaction" API like begin_allocate() 
> and end_allocate().

I wasn't thinking of resetting X->a to its original value, just
freeing A and B.  For a shared cache in this memory area, you need to
make sure that you never leak memory; it must either be referenced by
the cache book keeping data structures so that you can find it and
manually free it eventually (probably with an LRU system?), or it must
somehow be cleaned up if an error occurs before you manage to insert
it into the cache.  During construction of an object graph, before you
attach it to your manual book keeping data structures, there is a
danger zone.

For example, what if, while copying a query plan or a catcache entry,
we successfully allocate a new cache entry with palloc(), but then
another palloc() call fails because we run out of memory when copying
the keys?  It looks like the current catcache.c coding just doesn't
care: memory will be leaked permanently in CacheMemoryContext.  That's
OK because it is process-local.  Such a process is probably doomed
anyway.  If you exit and then start a new backend the problem is gone.
Here we need something better, because once this new kind of memory
fills up with leaked objects, you have to restart the whole cluster to
recover.

If the solution involves significantly different control flows (like
PG_TRY(), PG_FINALLY() manual cleanup all over the place), then lots
of existing palloc-based code will need to be reviewed and rewritten
very carefully for use in this new kind of memory context.  If it can
be done automagically, then more of our existing code will be able to
participate in the construction of stuff that we want to put in shared
memory.

I first thought about this in the context of shared plan caches, a
topic that appears on this mailing list from time to time.  There are
some other fun problems, like cache invalidation for shared caches,
space recycling (LRUs etc), and of course locking/concurrency
(dsa_allocate() and dsa_free() are concurrency safe, but whatever data
structures you build with the memory of course need their own plan for
dealing with concurrency).  But a strategy for clean-up on errors
seems to be a major subproblem to deal with early in the project.  I
will be very happy if we can come up with something :-)

On Wed, Nov 14, 2018 at 12:40 AM Ideriha, Takeshi
 wrote:
> From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
> >postgres=# call hoge_list(3);
> >NOTICE:  Contents of list:
> >NOTICE:   1
> >NOTICE:   2
> >NOTICE:   3
> >CALL
> >
> >You can call that procedure from various different backends to add and remove
> >integers from a List.  The point is that it could just as easily be a query 
> >plan or any
> >other palloc-based stuff in our tree, and the pointers work in any backend.
>
> Thanks for the patch. I know it's a rough sketch but basically that's what I 
> want to do.
> I tried it and it works well.

After your message I couldn't resist a small experiment.  But it seems
there are plenty more problems to be solved to make it useful...

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-13 Thread Amit Langote

On 2018/11/14 0:32, Jesper Pedersen wrote:
> Hi,
> 
> On 11/12/18 6:17 PM, David Rowley wrote:
>> On 9 November 2018 at 19:18, Amit Langote
>>  wrote:
>>> I have a comment regarding how you chose to make
>>> PartitionTupleRouting private.
>>>
>>> Using the v14_to_v15 diff, I could quickly see that there are many diffs
>>> changing PartitionTupleRouting to struct PartitionTupleRouting, but they
>>> would be unnecessary if you had added the following in execPartition.h, as
>>> my upthread had done.
>>>
>>> -/* See execPartition.c for the definition. */
>>> +/* See execPartition.c for the definitions. */
>>>   typedef struct PartitionDispatchData *PartitionDispatch;
>>> +typedef struct PartitionTupleRouting PartitionTupleRouting;
>>
>> Okay, done that way. v16 attached.

Thank you.

> Still passes check-world.

I looked at v16 and noticed a few typos:

+  * partition_dispatch_info Array of 'dispatch_allocsize' elements containing
+  * a pointer to a PartitionDispatch objects for

a PartitionDispatch objects -> a PartitionDispatch object

+  * partitions  Array of 'partitions_allocsize' elements
+  * containing pointers to a ResultRelInfos of all
+  * leaf partitions touched by tuple routing.

a ResultRelInfos -> ResultRelInfos

+ * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.

"in" the 'partitions' array.

+/* Setup the PartitionRoutingInfo for it */

Setup -> Set up

+ *  Ensure there's enough space in the 'partitions' array of 'proute'

+ *  and store it in the next empty slot in proute's partitions array.


Not a typo, but maybe just write proute->partitions instead of "partitions
array of proute" and "proute's partition array".

+ *  the next available slot in the 'proute' partition_dispatch_info[]
+ *  array.  Also, record the index into this array in the 'parent_pd'

Similarly, here: proute->partition_dipatch_info array

+ *  array.  Also, record the index into this array in the 'parent_pd'
+ *  indexes[] array in the partidx element so that we can properly

Similarly, parent_pd->indexes array

+if (dispatchidx >= proute->dispatch_allocsize)
+{
+/* Expand allocated space. */
+proute->dispatch_allocsize *= 2;
+proute->partition_dispatch_info = (PartitionDispatchData **)
+repalloc(proute->partition_dispatch_info,
+ sizeof(PartitionDispatchData *) *
+ proute->dispatch_allocsize);
+}

Sorry, I forgot to point this out before, but can this code in
ExecInitPartitionDispatchInfo be accommodated in
ExecCheckPartitionArraySpace() for consistency?

Thanks,
Amit

Re: Convert MAX_SAOP_ARRAY_SIZE to new guc

2018-11-13 Thread James Coleman

Note: the original email from David went to my spam folder, and it also
didn't show up on the archives (I assume caught by a spam filter there
also?)

Thanks for taking this on!
>
> As far as you can tell, is the default correct at 100?
>

I'm not sure what a good way of measuring it would be (that is, what all
the possible cases are). I did try very simple SELECT * FROM t WHERE i IN
(...) style queries with increasing size and was able to see increased
planning time, but nothing staggering (going from 1000 to 2000 increased
from ~1.5ms to 2.5ms planning time, in an admittedly very unscientific
test.)

I think it's reasonable to leave the default at 100 for now. You could make
an argument for increasing it since the limit currently affects whether
scalar array ops can use partial indexes with "foo is not null" conditions,
but I think that's better solved more holistically, as I've attempted to do
in
https://www.postgresql.org/message-id/CAAaqYe8yKSvzbyu8w-dThRs9aTFMwrFxn_BkTYeXgjqe3CbNjg%40mail.gmail.com


> What are some issues that might arise if it's set too low/too high?
>

Too low would result in queries being planned unsatisfactorily (i.e.,
scalar array ops switching from partial index scans to seq scans), and
setting it too high could significantly increase planning time.

Re: Refactoring the checkpointer's fsync request queue

2018-11-13 Thread Thomas Munro

(Replies to a couple of different messages below)

On Wed, Nov 14, 2018 at 6:04 AM Robert Haas  wrote:
> On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
>  wrote:
> > There is one major problem with this patch
>
> If there's only one, you're doing great!  Although admittedly this
> seems like a big one...

Make that two.

> > 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> > keep draining the pipe.  Communication between checkpointer and
> > bgwriter can be fairly easily multiplexed with the pipe draining work.
>
> That sounds a little like you are proposing to go back to the way
> things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
> belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
> the division of labor wouldn't be quite the same.

But is there an argument against it?  The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

Admittedly it adds a whole extra rabbit hole to this rabbit hole,
which was itself a diversion from my goal of refactoring the syncing
machinery to support undo logs.  But the other two ideas seem to suck
and/or have correctness issues.

On Wed, Nov 14, 2018 at 7:43 AM Robert Haas  wrote:
> On Tue, Nov 13, 2018 at 1:07 PM Andres Freund  wrote:
> > On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> > > I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> > > a bit of a fool's errand, though.  I mean, there's no guarantee that
> > > the first FD that gets passed to the checkpointer is the first one
> > > opened, or even the first one written, is there?
> > I'm not sure I understand the danger you're seeing here. It doesn't have
> > to be the first fd opened, it has to be an fd that's older than all the
> > writes that we need to ensure made it to disk. And that ought to be
> > guaranteed by the logic?  Between the FileWrite() and the
> > register_dirty_segment() (and other relevant paths) the FD cannot be
> > closed.
>
> Suppose backend A and backend B open a segment around the same time.
> Is it possible that backend A does a write before backend B, but
> backend B's copy of the fd reaches the checkpointer before backend A's
> copy?  If you send the FD to the checkpointer before writing anything
> then I think it's fine, but if you write first and then send the FD to
> the checkpointer I don't see what guarantees the ordering.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

In your example, both fds arrive in the checkpointer in some order,
and it will keep the one with the older sequence number and close the
other one.  This sorting of all interesting fds will be forced before
the checkpoint completes by AbsorbFsyncRequests(), which drains all
messages from the pipe until it sees a message for the next checkpoint
cycle.

Hmm, I think there is a flaw in the plan here though.  Messages for
different checkpoint cycles race to enter the pipe around the time the
cycle counter is bumped, so you could have a message for n hiding
behind a message for n + 1 and not drain enough; I'm not sure and need
to look at something else today, but I see a couple of potential
solutions to that which I will mull over, based on either a new shared
counter increment or a special pipe message written after BufferSync()
by the bgwriter (if we go for idea #2; Andres had something similar in
the original prototype but it could self-deadlock).  I need to figure
out if that is a suitable barrier due to buffer interlocking.

> > > It seems like if you wanted to make this work reliably, you'd need to
> > > do it the other way around: have the checkpointer (or some other
> > > background process) open all the FDs, and anybody else who wants to
> > > have one open get it from the checkpointer.
> >
> > That'd require a process context switch for each FD opened, which seems
> > clearly like a no-go?
>
> I don't know how bad that would be.  But hey, no cost is too great to
> pay as a workaround for insane kernel semantics, right?

Yeah, seems extremely expensive and unnecessary.  It seems sufficient
to track the global opening order... or at least a proxy that
identifies the fd that performed the oldest write.  Which I believe
this patch is doing.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: date_trunc() in a specific time zone

2018-11-13 Thread Tom Lane

I wrote:
> BTW, I'd been hoping that we could avoid rotate-to-local-and-back
> in Vik's desired case, but after further thought I suspect the only
> real optimization that's possible compared to writing it out with
> two AT TIME ZONE constructs is to do the zone name lookup just once.
> As an example, truncating to a day-or-larger boundary could result in
> shifting to a different UTC offset than you started with, due to crossing
> a DST boundary.

Here's a v2 that transposes the code to C so that we can get that
optimization.  I've not tried to time it, but it should actually be
a bit faster than standard date_trunc plus one AT TIME ZONE rotation,
never mind two of them.

regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1678c8c..adffa7d 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT regexp_match('abc01234xyz', '(?:(
*** 7186,7191 
--- 7186,7200 

+ date_trunc(text, timestamp with time zone, text)
+ timestamp with time zone
+ Truncate to specified precision in the specified time zone; see also 
+ 
+ date_trunc('day', timestamptz '2001-02-16 20:38:40+00', 'Australia/Sydney')
+ 2001-02-16 13:00:00+00
+
+ 
+
  date_trunc(text, interval)
  interval
  Truncate to specified precision; see also 
*** SELECT date_part('hour', INTERVAL '4 hou
*** 8078,8084 

! date_trunc('field', source)

  source is a value expression of type
  timestamp or interval.
--- 8087,8093 

! date_trunc(field, source [, time_zone])

  source is a value expression of type
  timestamp or interval.
*** date_trunc('field

! Examples:

  SELECT date_trunc('hour', TIMESTAMP '2001-02-16 20:38:40');
  Result: 2001-02-16 20:00:00

  SELECT date_trunc('year', TIMESTAMP '2001-02-16 20:38:40');
  Result: 2001-01-01 00:00:00

--- 8121,8158 

! If the optional time_zone argument is
! present, the source value is truncated in the
! specified time zone; for example, truncation to day
! produces a value that is midnight in that zone.  The time zone name can
! be specified in any of the ways described in
! .
!
! 
!
! When the time_zone argument is
! present, the source and result are always of
! type timestamp with time zone, whereas the two-argument
! form of date_trunc is available for timestamps with
! or without time zone.  The two-argument form truncates timestamp
! with time zone values using the current
!  setting.
!
! 
!
! Examples (assuming the local time zone is America/New_York):

  SELECT date_trunc('hour', TIMESTAMP '2001-02-16 20:38:40');
  Result: 2001-02-16 20:00:00

  SELECT date_trunc('year', TIMESTAMP '2001-02-16 20:38:40');
  Result: 2001-01-01 00:00:00
+ 
+ SELECT date_trunc('day', TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40+00');
+ Result: 2001-02-16 00:00:00-05
+ 
+ SELECT date_trunc('day', TIMESTAMP WITH TIME ZONE '2001-02-16 20:38:40+00', 'Australia/Sydney');
+ Result: 2001-02-16 08:00:00-05

diff --git a/src/backend/utils/adt/timestamp.c b/src/backend/utils/adt/timestamp.c
index 449164a..e5f8b51 100644
*** a/src/backend/utils/adt/timestamp.c
--- b/src/backend/utils/adt/timestamp.c
*** timestamp_trunc(PG_FUNCTION_ARGS)
*** 3925,3938 
  	PG_RETURN_TIMESTAMP(result);
  }

! /* timestamptz_trunc()
!  * Truncate timestamp to specified units.
   */
! Datum
! timestamptz_trunc(PG_FUNCTION_ARGS)
  {
- 	text	   *units = PG_GETARG_TEXT_PP(0);
- 	TimestampTz timestamp = PG_GETARG_TIMESTAMPTZ(1);
  	TimestampTz result;
  	int			tz;
  	int			type,
--- 3925,3939 
  	PG_RETURN_TIMESTAMP(result);
  }

! /*
!  * Common code for timestamptz_trunc() and timestamptz_trunc_zone().
!  *
!  * tzp identifies the zone to truncate with respect to.  We assume
!  * infinite timestamps have already been rejected.
   */
! static TimestampTz
! timestamptz_trunc_internal(text *units, TimestampTz timestamp, pg_tz *tzp)
  {
  	TimestampTz result;
  	int			tz;
  	int			type,
*** timestamptz_trunc(PG_FUNCTION_ARGS)
*** 3943,3951 
  	struct pg_tm tt,
  			   *tm = 

- 	if (TIMESTAMP_NOT_FINITE(timestamp))
- 		PG_RETURN_TIMESTAMPTZ(timestamp);
- 
  	lowunits = downcase_truncate_identifier(VARDATA_ANY(units),
  			VARSIZE_ANY_EXHDR(units),
  			false);
--- 3944,3949 
*** timestamptz_trunc(PG_FUNCTION_ARGS)
*** 3954,3960 

  	if (type == UNITS)
  	{
! 		if (timestamp2tm(timestamp, , tm, , NULL, NULL) != 0)
  			ereport(ERROR,
  	(errcode(ERRCODE_DATETIME_VALUE_OUT_OF_RANGE),
  	 errmsg("timestamp out of range")));
--- 3952,3958 

  	if (type == UNITS)
  	{
! 		if (timestamp2tm(timestamp, , tm, , NULL, tzp) != 0)
  			ereport(ERROR,

Re: add the source of param misconfigurations to error messages

2018-11-13 Thread Tom Lane

Andres Freund  writes:
> On 2018-11-13 17:33:01 -0500, Tom Lane wrote:
>> Seems to me it'd result in an impossibly unwieldy message, especially
>> once you realize you might have to deal with other value sources than
>> files.  Adhering to the translatability guidelines (ie, "don't construct
>> messages out of parts") would be problematic for that too.

> Note that I'm convinced this is a necessary feature.  But if we were to
> do it, I'd assume we put something like this in the DETAIL not the
> ERROR itself. That ought to alievate some of those concerns?

It would still be pretty unwieldy.

>> We already have fairly substantial support for diagnosing such mistakes
>> through the pg_file_settings view, though admittedly if you don't use
>> that *before* restarting the server, it does not help with this.

> Would be kind of useful if --describe-config, or a version thereof,
> would print that kind of information too.

Hmm, that's an idea.  We could probably refactor the pg_file_settings
infrastructure so it could be invoked by a command line option (maybe call
it --debug-config?) and just dump the results to stdout.  That would have
the advantage that it could report on more than one error, if there is
one.

regards, tom lane

Re: add the source of param misconfigurations to error messages

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-13 17:33:01 -0500, Tom Lane wrote:
> Jordan Deitch  writes:
> > $ postgres --config-file="/etc/postgresql/10/main/postgresql.conf"
> > can fail with the following error:
> > postgres: superuser_reserved_connections must be less than max_connections
> 
> > This is due to the addition of the postgresql.auto.conf params. 
> > Would the community welcome a patch whereby this detail in contained in the 
> > error? 
> 
> Seems to me it'd result in an impossibly unwieldy message, especially
> once you realize you might have to deal with other value sources than
> files.  Adhering to the translatability guidelines (ie, "don't construct
> messages out of parts") would be problematic for that too.

Note that I'm convinced this is a necessary feature.  But if we were to
do it, I'd assume we put something like this in the DETAIL not the
ERROR itself. That ought to alievate some of those concerns?


> We already have fairly substantial support for diagnosing such mistakes
> through the pg_file_settings view, though admittedly if you don't use
> that *before* restarting the server, it does not help with this.

Would be kind of useful if --describe-config, or a version thereof,
would print that kind of information too.

Greetings,

Andres Freund

Re: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-13 17:27:17 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > On 2018-11-13 11:40:05 +0100, Christoph Berg wrote:
> >> While working on making extension modules built reproducibly, I
> >> noticed that extra flags passed via COPT (notably -ffile-prefix-map)
> >> do not get added to CXXFLAGS.
> 
> > PROFILE I can see, but COPT I'm less sure. The name suggests it's about
> > C not C++. How about adding CXXOPT?
> 
> > Secondary question:  Why aren't you using CFLAGS/CXXFLAGS for this?
> 
> COPT (and PROFILE) are handy for injecting additional flags manually;
> they're even documented for that (cf installation.sgml, around line
> 1550).

> I agree that CXXOPT would be a better idea than COPT for this.

Yea, I agree that we want CXX equivalent of the feature, especially for
-Werror.  I'm just not sure that's the best fit for the debian build
issue Christoph ran into. I guess the goal is to *not* include the
-ffile-prefix-map in the CFLAGS for extensions, unless those are also
built via debian machinery?


> Not sure about PROFILE.  But we could inject the latter into both
> flag sets and then document that if that's not what you want, use
> COPT.

That seems reasonable to me too.

Greetings,

Andres Freund

Re: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well

2018-11-13 Thread Tom Lane

Andres Freund  writes:
> On 2018-11-13 11:40:05 +0100, Christoph Berg wrote:
>> While working on making extension modules built reproducibly, I
>> noticed that extra flags passed via COPT (notably -ffile-prefix-map)
>> do not get added to CXXFLAGS.

> PROFILE I can see, but COPT I'm less sure. The name suggests it's about
> C not C++. How about adding CXXOPT?

> Secondary question:  Why aren't you using CFLAGS/CXXFLAGS for this?

COPT (and PROFILE) are handy for injecting additional flags manually;
they're even documented for that (cf installation.sgml, around line
1550).

I agree that CXXOPT would be a better idea than COPT for this.
Not sure about PROFILE.  But we could inject the latter into both
flag sets and then document that if that's not what you want, use COPT.

regards, tom lane

add the source of param misconfigurations to error messages

2018-11-13 Thread Jordan Deitch

Hi Hackers -

The command 

$ postgres --config-file="/etc/postgresql/10/main/postgresql.conf"

can fail with the following error:

postgres: superuser_reserved_connections must be less than max_connections

even if max_connections > superuser_reserved_connections in that config file

This is due to the addition of the postgresql.auto.conf params. 
Would the community welcome a patch whereby this detail in contained in the 
error? 

something to the effect of:

postgres: superuser_reserved_connections must be less than max_connections 
(superuser_reserved_connections is set to 10 in $path$ and max_connections is 
set to 10 in $path$)

Alternatively, we can track alter system changes in the postgresql.conf, so we 
have a unified source of parameter configurations.

Thanks!


-- 
Jordan Deitch
https://id.rsa.pub/

Re: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-13 11:40:05 +0100, Christoph Berg wrote:
> While working on making extension modules built reproducibly, I
> noticed that extra flags passed via COPT (notably -ffile-prefix-map)
> do not get added to CXXFLAGS.

PROFILE I can see, but COPT I'm less sure. The name suggests it's about
C not C++. How about adding CXXOPT?

Secondary question:  Why aren't you using CFLAGS/CXXFLAGS for this?

Greetings,

Andres Freund

Re: Fixing findDependentObjects()'s dependency on scan order (regressions in DROP diagnostic messages)

2018-11-13 Thread Peter Geoghegan

On Mon, Nov 5, 2018 at 7:46 PM Andrey Lepikhov
 wrote:
> Problem No. 1 will be amplified with new asynchronous operations,
> background workers and distributing query execution. It is not problem
> of DBMS. The solution is change the tests: include sorting of query
> results, sorting of system messages before diff operation.
> If steady order of messages is critical for users we can sort
> targetObjects array in the begin of reportDependentObjects() routine by
> classId, objectId and objectSubId fields.

Yeah, maybe. I'm not that worried about the order of objects; we can
probably continue to get away with suppressing the list of objects by
placing a "\set VERBOSITY terse" where needed -- that's something
we've already been doing for some time [1]. I accept that it would be
better to sort the output, but I'm concerned that that would be a
difficult, risky project. What if there was a huge number of dependent
objects? What if a memory allocation fails?

> Problem No. 2: we can suppress some output optimizations in
> object_address_present_add_flags() routine and print all deleted objects.

If there is anything that makes it necessary to sort, it's this -- the
order of visitation can affect whether or not
object_address_present_add_flags() suppresses redundant entries. But I
still prefer to fix the problem by changing the scan order to be what
we actually want it to be.

> Problem No. 3: I suppose we can go one of two ways:
> a) print all depended objects returned by scan of DependDependerIndexId
> relation, not only the first.
> b) search a root of dependence and print only it.

A solution does occur to me that I'm kind of embarrassed to suggest,
but that would nonetheless probably do the job:

The general problem here is that the scan order of a scan that uses
pg_depend_depender_index doesn't reliably give us the order we
actually want among duplicate index entries (at least, you could
choose to characterize it in that narrow way). The index
pg_depend_depender_index looks like this:

  Column  │  Type   │ Key? │ Definition
──┼─┼──┼
 classid  │ oid │ yes  │ classid
 objid│ oid │ yes  │ objid
 objsubid │ integer │ yes  │ objsubid
btree, for table "pg_catalog.pg_depend"

Note that this isn't a unique index -- there are no unique indexes on
pg_depend at all, in fact. (Note also that the same observations apply
to pg_shdepend.)

The objsubid (not refobjsubid) is 0 in all cases that we have problems
with -- we're never going to have a problem with multiple
dependent-object-is-column entries that have the same '(classid,
objid, objsubid)' value, as far as I can tell:

pg@regression[2201]=# SELECT count(*), classid, objid, objsubid FROM
pg_depend WHERE objsubid != 0 GROUP BY classid, objid, objsubid HAVING
count(*) > 1;
 count │ classid │ objid │ objsubid
───┼─┼───┼──
(0 rows)

(i.e. We're only having problems with some of the entries/values that
you'll see when the above query is changed to have "bjsubid = 0" in
its WHERE clause.)

Why not vary the objsubid value among entries that don't use it
anyway, so that they have a negative integer objsubid values rather
than 0? That would have the effect of forcing the scan order of
pg_depend_depender_index scans to be whatever we want it to be.
dependent-object-is-column entries would not have their sort order
affected anyway, because in practice there is only one entry with the
same  '(classid, objid, objsubid)' value when objsubid > 0.

We could invent some scheme that made pg_depend/pg_shdepend entries
use fixed objsubid negative integer codes according to the
*referenced* object type (refobjid), and/or the entry's deptype.
Places that look at ObjectAddress.objectSubId would look for entries
that were > 0 rather than != 0, so we wouldn't really be changing the
meaning of objsubid (though it would need to be documented in the user
docs, and have a release note compatibility entry).
findDependentObjects() would still use "nkeys = 2" for these entries;
it would be one of the places that would be taught to check
objectSubId > 0 rather than objectSubId != 0. But it would get the
scan order it actually needs.

I haven't attempted to put this together just yet, because I want to
see if it passes the laugh test. My sense is that the best way to fix
the problem is to force the scan order to be the one that we
accidentally depend on using some mechanism or other. Though not
necessarily the one I've sketched out.

[1] https://postgr.es/m/24279.1525401...@sss.pgh.pa.us
--
Peter Geoghegan

Re: Restore CurrentUserId only if 'prevUser' is valid when abort transaction

2018-11-13 Thread Tom Lane

Michael Paquier  writes:
> On Mon, Nov 12, 2018 at 08:17:29PM -0500, Tom Lane wrote:
>> I think the real bug here is that a bunch of potentially-failable
>> operations have been thrown in before we've finished initializing the
>> TransactionState to minimal sanity.  (Inserting code with the aid of a
>> dartboard seems to be a chronic disease around here :-(.)

> When first working on the patch I got to wonder if there were any
> intermediate states which relied on the user ID of the security context
> flags which could have justified its current position.  Just checking
> now it looks safe to move up the call.  I have checked as well my test
> cases injecting errors.  What do you think about the attached? 

Looks sane to me.

> Also, I think that we should backpatch something all the way down.

Yes.

>> I'd be strongly inclined to change the elog(WARNING) at line 1815
>> to an assertion, because calling elog exposes us to all kinds of
>> hazards that we don't need here.

> No objections from here.  I would do that only on HEAD though.

Well, if it is an issue then it's an issue for back branches too.
It's fine, I think, as long as the warning stays a warning ...
but there are lots of ways in which trying to print a warning
might turn into an error.

regards, tom lane

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Tomas Vondra

On Tue, 2018-11-13 at 18:53 +0100, Jean-Christophe Arnu wrote:
> Le lun. 5 nov. 2018 à 15:37, Jean-Christophe Arnu 
> a
> écrit :
> 
> > 
> > 
> > Le dim. 4 nov. 2018 à 18:01, Jean-Christophe Arnu  > > a
> > écrit :
> > 
> > > Le ven. 2 nov. 2018 à 08:37, Peter Eisentraut <
> > > peter.eisentr...@2ndquadrant.com> a écrit :
> > > 
> > > > On 26/10/2018 15:53, Jean-Christophe Arnu wrote:
> > > > > Exemple on CREATE DATABASE (without defining a template
> > > > > database) :
> > > > > rmgr: Databaselen (rec/tot): 42/42,
> > > > > tx:568, lsn:
> > > > > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to
> > > > > 16384/1663
> > > > > 
> > > > > It comes out (to me) it may be more consistent to use the
> > > > > same template
> > > > > than the other operations in pg_waldump.
> > > > > I propose to swap the parameters positions for the copy dir
> > > > > operation
> > > > > output.
> > > > > 
> > > > > You'll find a patch file included which does the switching
> > > > > job :
> > > > > rmgr: Databaselen (rec/tot): 42/42,
> > > > > tx:568, lsn:
> > > > > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to
> > > > > 1663/16384
> > > > 
> > > > I see your point.  I suspect this was mainly implemented that
> > > > way
> > > > because that's how the fields occur in the underlying
> > > > xl_dbase_create_rec structure.  (But that structure also has
> > > > the target
> > > > location first, so it's not entirely consistent that way
> > > > either.)  We
> > > > could switch the fields around in the output, but I wonder
> > > > whether we
> > > > couldn't make the whole thing a bit more readable like this:
> > > > 
> > > > desc: CREATE copy dir ts=1663 db=1 to ts=1663 db=16384
> > > > 
> > > > or maybe like this
> > > > 
> > > > desc: CREATE copy dir (ts/db) 1663/1 to 1663/16384
> > > > 
> > > 

I don't know, but this feels like a step too far to me.

The patch started with the goal of making the CREATE DATABASE format
more consistent w.r.t. to other pg_waldump output, and this seems more
like redesigning the whole format with no clear use case.

People reading pg_waldump output quickly learn to read the A/B/C format
and what those fields mean. Breaking that into ts=A db=B relfilenode=C
does not make that particularly clearer or easier to read. I'd say it'd
also makes it harder to parse, and it increases the size of the output
(both in terms of line length and data size).

So -1 to this from me. I'd just swap the fields in the copy dir message
and call it a day.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Tomas Vondra

On Tue, 2018-11-13 at 12:24 -0800, Andres Freund wrote:
> Hi,
> 
> On 2018-11-13 17:18:32 -0300, Alvaro Herrera wrote:
> > On 2018-Nov-13, Peter Eisentraut wrote:
> > 
> > > On 13/11/2018 18:53, Jean-Christophe Arnu wrote:
> > > > May I request any update on my (not so important) proposal ?
> > > 
> > > Other opinions out there perhaps?  I think your proposals are
> > > good, but
> > > I'm not sure to what extent people are automatically parsing the
> > > pg_waldump output and are relying on a particular fixed format.
> 
> I personally don't care either way about this change. But:
> 
> > While I've used pg_waldump a number of times to investigate various
> > problems, I have never written a script that would be broken by the
> > proposed changes.  It seems the requirements vary every time.
> 
> I'm not too concerned either. There's plenty change of the WAL
> contents across versions anyway. We shouldn't break things
> gratuitously, but we shouldn't hesitate to make the format better
> either.
> 
> If somebody really wanted something that parses reasonably across
> versions I'd like to a) hear that usecase b) come up with an output
> format that's oriented towards that.
> 

+1 to that

I do process pg_waldump output quite often, and IMHO it's more valuable
to produce consistently formatted output (for a given version) than try
maintaining cross-version compatibility at all costs.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-13 17:18:32 -0300, Alvaro Herrera wrote:
> On 2018-Nov-13, Peter Eisentraut wrote:
> 
> > On 13/11/2018 18:53, Jean-Christophe Arnu wrote:
> > > May I request any update on my (not so important) proposal ?
> > 
> > Other opinions out there perhaps?  I think your proposals are good, but
> > I'm not sure to what extent people are automatically parsing the
> > pg_waldump output and are relying on a particular fixed format.

I personally don't care either way about this change. But:

> While I've used pg_waldump a number of times to investigate various
> problems, I have never written a script that would be broken by the
> proposed changes.  It seems the requirements vary every time.

I'm not too concerned either. There's plenty change of the WAL contents
across versions anyway. We shouldn't break things gratuitously, but we
shouldn't hesitate to make the format better either.

If somebody really wanted something that parses reasonably across
versions I'd like to a) hear that usecase b) come up with an output
format that's oriented towards that.

Greetings,

Andres Freund

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Alvaro Herrera

On 2018-Nov-13, Peter Eisentraut wrote:

> On 13/11/2018 18:53, Jean-Christophe Arnu wrote:
> > May I request any update on my (not so important) proposal ?
> 
> Other opinions out there perhaps?  I think your proposals are good, but
> I'm not sure to what extent people are automatically parsing the
> pg_waldump output and are relying on a particular fixed format.

While I've used pg_waldump a number of times to investigate various
problems, I have never written a script that would be broken by the
proposed changes.  It seems the requirements vary every time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Peter Eisentraut

On 13/11/2018 18:53, Jean-Christophe Arnu wrote:
> May I request any update on my (not so important) proposal ?

Other opinions out there perhaps?  I think your proposals are good, but
I'm not sure to what extent people are automatically parsing the
pg_waldump output and are relying on a particular fixed format.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal: simple query profile and tracing API

2018-11-13 Thread Tomas Vondra

On Tue, 2018-11-13 at 13:55 +0100, Pavel Stehule wrote:
> út 13. 11. 2018 v 13:12 odesílatel legrand legrand <
> legrand_legr...@hotmail.com> napsal:
> 
> > Hello Pavel,
> > 
> > What about using wait events and a trigger on pg_stat_activity ?
> > 
> 
> pg_stat_activity should not to show fresh data. Using
> pg_stat_activity can be too expensive for fast queries
> 

More importantly, how would you create a trigger on pg_stat_activity,
considering it's a system view backed by SRF?

> > ...
> > An other solution: a customized version of pgsentinel (for high
> > frequency sampling):
> > 
> 
> I don't believe to sampling method - I talk about less than 10ms
> queries, I would to see a 2-3ms planning time, 2-5ms waitings - and
> it means sampling aboy 2ms, what is expensive
> 

You're quietly assuming that whatever alternative solution you end up
inventing will be cheaper than this sampling. Which is going to be
hard, if you want to do that for every execution of even the shortest
queries. I'd say that's doomed to fail.

Moreover, the sampling does not need to catch every query execution.
The idea is to do it "just often enough" for some desired accuracy. For
example you might pick 10ms interval - it will hit even shorter queries
if they are executed often enough (and if they're not, who cares about
them?). And given the sample percentages and total time, you can do
some estimates for each query / phase.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Minor fail in realfailN scanner rules

2018-11-13 Thread Tom Lane

While fooling with John Naylor's ecpg lexer sync patch, my attention
was drawn to the "realfail1" flex rule, which triggers when we see
digits immediately followed by "e", but no exponent after that:

{realfail1}{
/*
 * throw back the [Ee], and treat as {decimal}.  Note
 * that it is possible the input is actually {integer},
 * but since this case will almost certainly lead to a
 * syntax error anyway, we don't bother to distinguish.
 */
yyless(yyleng - 1);
SET_YYLLOC();
yylval->str = pstrdup(yytext);
return FCONST;
}

I think that code and comment are mine, but I realized that it's overly
optimistic to suppose that the situation can't happen.  Consider

SELECT 42efoo, 45 ebar;
 efoo | ebar 
--+--
   42 |   45
(1 row)

The first target item is lexed as FCONST then IDENT because of what
realfail1 has done, while the second one is lexed as ICONST then IDENT.

This is not great.  It happens to work anyway -- that is, the first
column is deemed to be int4 not numeric -- because make_const() is very
paranoid about what it might find in a T_Float constant.  But it might
well be that there are other syntactic contexts in which returning FCONST
would result in parse errors or unexpected behavior.

Fortunately, this doesn't really take any extra code to fix; we can
do something like the attached.  psql and ecpg should be corrected
to match, although it's certainly just cosmetic for psql, and probably
also for ecpg.

regards, tom lane

diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 6c6a6e3..74e34df 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -1005,22 +1005,18 @@ other			.
 }
 {realfail1}		{
 	/*
-	 * throw back the [Ee], and treat as {decimal}.  Note
-	 * that it is possible the input is actually {integer},
-	 * but since this case will almost certainly lead to a
-	 * syntax error anyway, we don't bother to distinguish.
+	 * throw back the [Ee], and figure out whether what
+	 * remains is an {integer} or {decimal}.
 	 */
 	yyless(yyleng - 1);
 	SET_YYLLOC();
-	yylval->str = pstrdup(yytext);
-	return FCONST;
+	return process_integer_literal(yytext, yylval);
 }
 {realfail2}		{
 	/* throw back the [Ee][+-], and proceed as above */
 	yyless(yyleng - 2);
 	SET_YYLLOC();
-	yylval->str = pstrdup(yytext);
-	return FCONST;
+	return process_integer_literal(yytext, yylval);
 }
 
 
@@ -1255,6 +1251,10 @@ litbufdup(core_yyscan_t yyscanner)
 	return new;
 }
 
+/*
+ * Process {integer}.  Note this will also do the right thing with {decimal},
+ * ie digits and a decimal point.
+ */
 static int
 process_integer_literal(const char *token, YYSTYPE *lval)
 {
@@ -1265,7 +1265,7 @@ process_integer_literal(const char *token, YYSTYPE *lval)
 	val = strtoint(token, , 10);
 	if (*endptr != '\0' || errno == ERANGE)
 	{
-		/* integer too large, treat it as a float */
+		/* integer too large (or contains decimal pt), treat it as a float */
 		lval->str = pstrdup(token);
 		return FCONST;
 	}

Re: Sync ECPG scanner with core

2018-11-13 Thread Tom Lane

Alvaro Herrera  writes:
> On 2018-Nov-13, Tom Lane wrote:
>> More generally, I wonder if it'd be practical to make pgc.l backup-free
>> altogether.  I'm not sure that the resulting gain in lexing speed would
>> really be of interest to anyone,

> Maybe time to compile the ecpg test cases during "make check-world"?

I'm dubious that the lexer is a significant part of that, though I could
be wrong ...

regards, tom lane

Re: Sync ECPG scanner with core

2018-11-13 Thread Alvaro Herrera

On 2018-Nov-13, Tom Lane wrote:

> More generally, I wonder if it'd be practical to make pgc.l backup-free
> altogether.  I'm not sure that the resulting gain in lexing speed would
> really be of interest to anyone,

Maybe time to compile the ecpg test cases during "make check-world"?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Refactoring the checkpointer's fsync request queue

2018-11-13 Thread Robert Haas

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund  wrote:
> On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> > I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> > a bit of a fool's errand, though.  I mean, there's no guarantee that
> > the first FD that gets passed to the checkpointer is the first one
> > opened, or even the first one written, is there?
> I'm not sure I understand the danger you're seeing here. It doesn't have
> to be the first fd opened, it has to be an fd that's older than all the
> writes that we need to ensure made it to disk. And that ought to be
> guaranteed by the logic?  Between the FileWrite() and the
> register_dirty_segment() (and other relevant paths) the FD cannot be
> closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy?  If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

> > It seems like if you wanted to make this work reliably, you'd need to
> > do it the other way around: have the checkpointer (or some other
> > background process) open all the FDs, and anybody else who wants to
> > have one open get it from the checkpointer.
>
> That'd require a process context switch for each FD opened, which seems
> clearly like a no-go?

I don't know how bad that would be.  But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Sync ECPG scanner with core

2018-11-13 Thread Tom Lane

John Naylor  writes:
> Commit 21c8ee7946 was one of several that sync'd the backend scanner
> with the psql scanner, with the commit message stating "Someday it'd
> be nice to make ecpg's pgc.l more easily diff'able too, but today is
> not that day." Attached is a set of patches to make progress towards
> that. The ECPG tests pass. There's probably more that could be done,
> but this seems like a good base to start from.

Looks like good stuff, pushed with a small additional amount of work.

> -base_yyerror() is used once, but every other call site uses
> mmerror(), so use that instead.

Yeah, I think this is a bug, though minor.  The other places that
deal with unexpected EOF use mmfatal(PARSE_ERROR, ...) though,
so I did it like that.

> In the course of doing this, I noticed a few other possible ECPG
> scanner cleanups to consider, although the payoff might not justify
> the work involved:

> -Copy process_integer_literal() from the core scanner (maybe only if
> decimal_fail is also brought over).

I went ahead and did both parts of that, mainly because as things
stood the ecpg lexer would produce a different token sequence for
"1..10" than the core would.  I don't think this really matters
right at the moment, but if for instance ecpg decided to insert
spaces where it thought the token boundaries were, it would matter.

> -Match core's handling of unicode escapes, even though pgc.l is not 
> backup-free.

Yeah, the amount of remaining diffs due to the omission of the anti-backup
rules is a bit annoying and confusing.

More generally, I wonder if it'd be practical to make pgc.l backup-free
altogether.  I'm not sure that the resulting gain in lexing speed would
really be of interest to anyone, but just in terms of using uniform lexer
design rules, it might be worthwhile.  I tried a run with -b just to see,
and indeed there are a bunch of backup cases in the ECPG-specific rules,
so it'd take some work.

> -Does ECPG still need its own functions for mm_alloc() and
> mm_strdup(), if we have equivalents in src/common?

The error handling isn't the same, so I don't think we can just replace
them with the src/common functions.  It is, however, potentially worth
our trouble to fix things so that within pgc.l, palloc() and pstrdup()
are macros for mm_alloc/mm_strdup, thereby further reducing the diffs
to the core lexer.

I'd be pretty tempted to also change all the hard references to
base_yylval to go through a pointer variable, so that diffs like
this go away:

-yylval->str = pstrdup(yytext);
+base_yylval.str = mm_strdup(yytext);

I believe that that'd result in more compact code, too --- on most
machines, references to global variables are more expensive in
code bytes than indirecting through a local pointer variable.

> -Use reentrant APIs?

Hm.  There's no functional value in that for ecpg, but maybe worth doing
anyway to get rid of diffs like

-addlit(yytext, yyleng, yyscanner);
+addlit(yytext, yyleng);

Guess it depends on how nitpicky you're feeling.  In principle,
reducing these remaining diffs would ease future maintenance of
the lexers, but they're not something we change often.

regards, tom lane

Re: Race condition in WaitForBackgroundWorkerStartup

2018-11-13 Thread Tomas Vondra

On Tue, 2018-11-13 at 07:57 -0600, Jeremy Finzel wrote:
> ...
>
> I'm unclear on what counts as "worker initialization".  The error is
> happening in the worker_spi_main function, not in the
> worker_spi_launch function.  So does an immediate error in
> worker_spi_main count as part of the worker initialization?
> 

I don't think it does (or should). worker_spi_main is pretty much the
worker body, and worker initialization pretty much means just "we've
initialized the worker process (including tracking it in shmem etc.)
and invoked it's main function".

I'd also argue the behavior is expected to depend on the machine to
some extent - depending on speed and load the machine may hit the error
before/after the GetBackgroundWorkerPid call, producing different
results.

So I'd say the behavior seems correct, at least from this POV.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Refactoring the checkpointer's fsync request queue

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:
> I still feel like this whole pass-the-fds-to-the-checkpointer thing is
> a bit of a fool's errand, though.  I mean, there's no guarantee that
> the first FD that gets passed to the checkpointer is the first one
> opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic?  Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

> It seems like if you wanted to make this work reliably, you'd need to
> do it the other way around: have the checkpointer (or some other
> background process) open all the FDs, and anybody else who wants to
> have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

2018-11-13 Thread Andres Freund

Hi,

On 2018-11-12 15:58:41 +1300, Thomas Munro wrote:
> There is one major problem with this patch:  BufferSync(), run in the
> checkpointer, can deadlock against a backend that holds a buffer lock
> and is blocked in SendFsyncRequest().  To break this deadlock, we need
> way out of it on either the sending or receiving side.  Here are three
> ideas:

That's the deadlock I'd mentioned in Pune (?) btw.

> 1.  Go back to the current pressure-valve strategy: make the sending
> side perform the fsync(), if it can't immediately write to the pipe.

I don't think that's correct / safe?  I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking.  It's
architecturally a bit dicey tho :(

Greetings,

Andres Freund

Re: wal_dump output on CREATE DATABASE

2018-11-13 Thread Jean-Christophe Arnu

Le lun. 5 nov. 2018 à 15:37, Jean-Christophe Arnu  a
écrit :

>
>
> Le dim. 4 nov. 2018 à 18:01, Jean-Christophe Arnu  a
> écrit :
>
>> Le ven. 2 nov. 2018 à 08:37, Peter Eisentraut <
>> peter.eisentr...@2ndquadrant.com> a écrit :
>>
>>> On 26/10/2018 15:53, Jean-Christophe Arnu wrote:
>>> > Exemple on CREATE DATABASE (without defining a template database) :
>>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1/1663 to 16384/1663
>>> >
>>> > It comes out (to me) it may be more consistent to use the same template
>>> > than the other operations in pg_waldump.
>>> > I propose to swap the parameters positions for the copy dir operation
>>> > output.
>>> >
>>> > You'll find a patch file included which does the switching job :
>>> > rmgr: Databaselen (rec/tot): 42/42, tx:568, lsn:
>>> > 0/01865790, prev 0/01865720, desc: CREATE copy dir 1663/1 to 1663/16384
>>>
>>> I see your point.  I suspect this was mainly implemented that way
>>> because that's how the fields occur in the underlying
>>> xl_dbase_create_rec structure.  (But that structure also has the target
>>> location first, so it's not entirely consistent that way either.)  We
>>> could switch the fields around in the output, but I wonder whether we
>>> couldn't make the whole thing a bit more readable like this:
>>>
>>> desc: CREATE copy dir ts=1663 db=1 to ts=1663 db=16384
>>>
>>> or maybe like this
>>>
>>> desc: CREATE copy dir (ts/db) 1663/1 to 1663/16384
>>>
>>
>>
>> Thank you Peter for your review and proposal .
>> I like the last one which gives the fields semantics in a concise way.
>> Pushing the idea a bit farther we could think of applying that
>> description to any other operation that involves the ts/db/oid fields. What
>> do you think ?
>>
>> Example in nbtdesc.c we could add the "(ts/db/oid)" information to help
>> the DBA to identify the objects:
>> case XLOG_BTREE_REUSE_PAGE:
>> {
>> xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) rec;
>>
>> appendStringInfo(buf, "rel (ts/db/rel) %u/%u/%u;
>> latestRemovedXid %u",
>>  xlrec->node.spcNode, xlrec->node.dbNode,
>>  xlrec->node.relNode,
>> xlrec->latestRemovedXid);
>> break;
>> }
>>
>> This may help DBAs to better identify the objects related to the
>> messages.
>>
>> Any thought/comments/suggestions ?
>>
>> ~~~ Side story
>> Well to be clear, my first proposal is born while i was writting a side
>> script to textually identify which objects were involved into pg_waldump
>> operations (if that objects already exists at script run time). I'm
>> wondering whether it could be interesting to incllde such a "textual
>> decoding" into pg_waldump or not (as a command line switch). Anyway, my
>> script would be available as proof of concept first. We have time to
>> discuss that last point in another thread.
>> ~~~
>>
>> Thank you
>>
>>>
> I've dug a little more in that way and spotted different locations in the
> code where such semantics might be useful too.
> Sometimes just like in XLOG_HEAP_TRUNCATE (heapdesc.c), there's a bunch of
> rels and descendants that are held by a single db. Adding the database id
> may be useful in that case. Decoding tablespace for each object may be
> interesting  (but not mandatory IMHO) for each rel. That information does
> not seem to be included in the structure, but as newcomer I assume there's
> a convenient way to retrieve that information easily.
>
> Another operation that might be eligible to end user message improvements
> is the one handled by xact_desc_commit() function (xactdesc.c) . Each time
> a COMMIT (XLOG_XACT_COMMIT or XLOG_XACT_COMMIT_PREPARED)  occurs the
> folllowing snippets is output :
>
>  desc: COMMIT 2018-11-05 15:11:03.087546 CET; rels: base/16384/16385
> base/16384/16388 base/16384/16390;
>
> in that case, file path is output using relpathperm macro which ends up
> callin gthe GetRelationPath() function.  In that last function, we can have
> the dbNode, spcNode (tablespace) and relNode Oid (include/common/relpath.h )
>
> The question is now to know if it would be interesting to have a
> consistent way to represent all objects and their hierarchy :
>  BTW, we could have db/ts/rel or ts/db/rel ?
> What about the "base/ts/rel" from XLOG_XACT_COMMIT/COMMIT_PREPARED ? Why
> is it different from the other representation (it must serve a purpose I
> assume) ?
> How do we represent multiples rels such as XLOG_HEAP_TRUNCATE ? My
> proposal (db/rels) dboid/ reloid1 reloid2 reloid3 ... reloidN  (as TRUNCATE
> only deals with one DB, but no tablespace is defined, this may be another
> point ?)
>
> Well, if you could give me some directions, I would appreciate !
>
> As ever, any thought, comments are more than welcomed.
>

Hello,
May I request any update on my (not so important) proposal ?

Thank you!
--

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-13 Thread Tomas Vondra

On Mon, 2018-11-12 at 11:33 -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
> > One of the guiding principles that I think we should hold for
> > partitioning is that operating directly into the partition should
> > be
> > seen as only an optimization compared to inserting into the parent
> > table
> > -- thus it should not behave differently.  Applying different
> > default
> > values depending on where you're inserting into goes counter to
> > that
> > principle.
> 
> I'm not entirely convinced that I buy that argument, especially not
> in a case like this where it introduces logical inconsistencies where
> there otherwise wouldn't be any.
> 

Not sure what new logical inconsistencies you have in mind, but the
principle proposed by Alvaro seems sensible to me. I too see insertions
into the partition primarily as an optimization, and would not expect
it to use other defaults than when inserting through the parent.

Particularly not defaults contradicting the partition definition and
forcing rerouting the tuple somewhere else automatically. (IME when
inserting directly into partitions people expect that data to end in
that partition, and if not it's an error - bogus data, ...).

I'm not claiming there are no use cases for defaults on partitions. For
example I can imagine multi-level partitioning, where each layer can
specify defaults for the lower levels. But it seems rather surprising
to only apply the partition defaults for direct insertions.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-13 Thread Tomas Vondra

On Tue, 2018-11-13 at 13:44 +0100, Jürgen Strobel wrote:
> On 2018-11-13 00:57, Tom Lane wrote:
> > =?UTF-8?Q?J=c3=bcrgen_Strobel?= 
> > writes:
> > > On 2018-11-12 17:33, Tom Lane wrote:
> > > > I'm not entirely convinced that I buy that argument, especially
> > > > not in
> > > > a case like this where it introduces logical inconsistencies
> > > > where there
> > > > otherwise wouldn't be any.
> > > I think I missed something, what are the *introduced* logical
> > > problems?
> > 
> > What to do if a partition introduces a default value that would
> > force
> > re-routing to some other partition.
> 
> I would claim this is a problem which already exists with the current
> behavior:
> 

I think the question is what part of the current behavior is
intentional (albeit with corner-cases not handled or rejected
properly), and what part is just not thought through. Not sure.

> a) If a tuple is inserted through the parent it's ignored and
> useless.
> b) If a tuple is inserted through the partition it can lead to
> inconsistency and a constraint violation (possibly rectifiable by a
> re-route which I think we all agree is out of the question).
> 

Yeah, allowing defaults mismatching the partition seems like a can of
worms. We certainly should not allow such rows to be inserted in the
partition. Routing them somewhere else seems tricky, so perhaps it'd be
better to just error-out in this case.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Refactoring the checkpointer's fsync request queue

2018-11-13 Thread Robert Haas

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
 wrote:
> There is one major problem with this patch

If there's only one, you're doing great!  Although admittedly this
seems like a big one...

> 1.  Go back to the current pressure-valve strategy: make the sending
> side perform the fsync(), if it can't immediately write to the pipe.

As you say, this will happen significantly more often with
deduplication.  That deduplication logic got added in response to a
real need.  Before that, you could cause an individual backend to
start doing its own fsyncs() with something as simple as a bulk load.
The queue would absorb most of them, but not all, and the performance
ramifications where noticeable.

> 2.  Offload the BufferSync() work to bgwriter, so the checkpointer can
> keep draining the pipe.  Communication between checkpointer and
> bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

> 3.  Multiplex the checkpointer's work:  Use LWLockConditionalAcquire()
> when locking buffers, and if that fails, try to drain the pipe, and
> then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop.  I
> can hear you groan already; that doesn't seem particularly elegant,
> and there are portability problems implementing LWLockTimedAcquire():
> semtimedop() and sem_timedwait() are not available on all platforms
> (eg macOS).  Maybe pthread_timed_condwait() could help (!).

You don't really need to invent LWLockTimedAcquire().  You could just
keep retrying LWLockConditionalAcquire() in a delay loop.  I agree
that doesn't seem particularly elegant, though.

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though.  I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?  It seems like if you
wanted to make this work reliably, you'd need to do it the other way
around: have the checkpointer (or some other background process) open
all the FDs, and anybody else who wants to have one open get it from
the checkpointer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Changing SQL Inlining Behaviour (or...?)

2018-11-13 Thread Paul Ramsey

On Fri, Nov 9, 2018 at 11:14 AM Paul Ramsey 
wrote:

> > Neglected to include the footnotes...
>
> As I promised at PgConf.eu, here's a follow-up email about PostGIS
> parallelism and function inlining (an odd combination, it is true).
>
> So, context:
>
> - We want PostGIS to parallelize more. In order to achieve that we need to
> mark our functions with more realistic COSTs. Much much higher COSTs.
> - When we do that, we hit a different problem. Our most commonly used
> functions, ST_Intersects(), ST_DWithin() are actually SQL wrapper functions
> are more complex combinations of index operators and exact computational
> geometry functions.
> - In the presence of high cost parameters that are used multiple times in
> SQL functions, PostgreSQL will stop inlining those functions, in an attempt
> to save the costs of double-calculating the parameters.
> - For us, that's the wrong choice, because we lose the index operators at
> the same time as we "save" the cost of double calculation.
> - We need our wrapper functions inlined, even when they are carrying a
> high COST.
>
> At pgconf.eu, I canvassed this problem and some potential solutions:
>
> * Solution #1 - Quick and dirty and visible: Add an 'INLINE' function
> decorator, which tells PostgreSQL to just ignore costs and inline the
> function regardless. Pros: it's not too hard to implement and I'm happy to
> contribute this. Cons: it adds very specific single-purpose syntax to
> CREATE FUNCTION.
>
> * Solution #2 - Quick and dirty and invisible. Tom suggested a hack that
> achieves the aims of #1 but without adding syntax to CREATE FUNCTION: have
> the inlining logic look at the cost of the wrapper and the cost of
> parameters, and if the cost of the wrapper "greatly exceeded" the cost of
> the parameters, then inline. So the PostGIS project would just set the cost
> of our wrappers very high, and we'd get the behaviour we want, while other
> users who want to use wrappers to force caching of calculations would have
> zero coded wrapper functions.  Pros: Solves the problem and easy to
> implement, I'm happy to contribute. Cons: it's so clearly a hack involving
> hidden (from users) magic.
>
> * Solution #3 - Correct and globally helpful. When first presented with
> this problem last year, both Andres and Tom said [1] "but the right fix is
> to avoid the double-calculation of identical entries in the target list"
> because then it would be safe to inline functions with duplicate expensive
> parameters. This would not only address the proximate PostGIS problem but
> make a whole class of queries faster. There was some discussion of this
> approach last week [2]. Pros: The right thing! Improves a whole pile of
> other performance cases. Cons: Hard! Only experienced PgSQL developers need
> apply.
>
> Naturally, I would love to see #3 implemented, but there's only so much
> experienced developer time to go around, and it's beyond my current skill
> set. I would like to be able to start to improve PostGIS parallelism with
> PgSQL 12, so in order to make that not impossible, I'd like to implement
> either #1 or #2 in case #3 doesn't happen for PgSQL 12.
>
> So my question to hackers is: which is less worse, #1 or #2, to implement
> and submit to commitfest, in case #3 does not materialize in time for PgSQL
> 12?
>

Absent any preferences, I would be inclined to go with #2, having a high
personal tolerance for ugly hacks... :)

P



>
> [1]
> https://www.postgresql.org/message-id/20171116182208.kcvf75nfaldv36uh%40alap3.anarazel.de
> [2] https://www.postgresql.org/message-id/10355.1540926295%40sss.pgh.pa.us
>
>

Re: Make description of heap records more talkative for flags

2018-11-13 Thread Bossart, Nathan

On 11/9/18, 6:11 PM, "Michael Paquier"  wrote:
> This thread needed a new patch to answer to the previous comment from
> Nathan, and it was not marked as waiting on author.  Sorry I got
> confused with the CF status and I have missed Nathan's update, which
> mentions a good idea.  So attached is an updated patch which adds "0x"
> in front of each flag output.

Thanks for the updated patch.  It looks good to me.

Nathan

Re: shared-memory based stats collector

2018-11-13 Thread Tomas Vondra

On Mon, 2018-11-12 at 20:10 +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 9 Nov 2018 14:16:31 +0100, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote in <
> 803f2d96-3b4b-f357-9a2e-45443212f...@2ndquadrant.com>
> > 
> > 
> > On 11/9/18 9:33 AM, Kyotaro HORIGUCHI wrote:
> > > Hello. This is rebased version.
> > > At Thu, 8 Nov 2018 16:06:49 +0100, Tomas Vondra
> > >  wrote in
> > > 
> > >> However a quite a few extensions in contrib seem are broken now.
> It
> > >> seems fixing it is as simple as including the new bestatus.h
> next to
> > >> pgstat.h.
> > > The additional 0009 does that.
> > > 
> > 
> > That does fix it, indeed. But the break happens in 0003, so that's
> > where the fixes should be moved - I've tried to simply apply 0009
> > right after 0003, but that does not seem to work because bestatus.h
> > does not exist at that point yet :-/
> 
> Sorry, I misunderstood you. The real reason for breaking 0003 as
> you saw was the result I just removed PG_STAT_TMP_DIR. 0005 fixes
> that later. I (half-intentionally) didin't keep soundness of the
> source tree at v8-0003 and v8-0008.
> 
> > The current split into 8 parts seems quite sensible to me, i.e.
> that's
> > how it might get committed eventually. That however means each part
> > needs to be correct on it's own (hence fixes in 0009 are a
> problem).
> 
> Thanks. I neatended up the patchset so that individual patch
> keeps source buildable and doesn't break programs' behaviors.
> 

OK, thanks. I'll take a look. I also plan to do much more testing, both
for correctness and performance - it's quite piece of functionality.

If everything goes well I'd like to get this committed by the end of
January CF (with some of the initial parts in this CF, possibly).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

spurious(?) warnings in archive recovery

2018-11-13 Thread Andrew Gierth

So while investigating a case of this warning (in
UpdateMinRecoveryPoint):

"xlog min recovery request %X/%X is past current point %X/%X"

I noticed that it is issued even in cases where we know that
minRecoveryPoint is not yet valid, for example because we're waiting to
see XLOG_BACKUP_END before declaring consistency.

But, you'd think, you shouldn't get this error because any page we
modify during recovery should have been restored from an FPI with a
suitably early LSN? For data pages that is correct, but not for VM or
(iff wal_log_hints or checksums are enabled) FSM pages.

When we replay an operation that, for example, clears a bit in the VM,
the redo code will read in that VM page from disk, and because we're not
yet consistent and because _clearing_ a VM bit is not in itself
wal-logged and doesn't result in any FPI being generated for the VM
page, it could well read a VM page that has a far-future LSN from the
point of view of replay, and dirty it, causing a later eviction to try
and do UpdateMinRecoveryPoint with that future LSN.

(I haven't investigated this aspect, but there also appears to be no
protection against torn pages in the VM when checksums are enabled? am I
missing something somewhere?)

I'm less clear on the exact mechanisms, but when wal_log_hints (or
checksums) is on, FSM pages also get LSNs, sometimes, thanks to
MarkBufferDirtyHint, and at least some code paths can also do
MarkBufferDirty on FSM pages during recovery, which would cause their
eviction with possible future LSNs as with VM pages.

This means that if you simply do an old-style base backup using
pg_start_backup/rsync/pg_stop_backup (on a sufficiently active system
and taking long enough) and then recover from it, you're likely to get a
log spammed with these errors for no very good reason.

So it seems to me that issuing this error is a bug if the conditions
described are actually harmless, while if they're not harmless, then
obviously that is a bug. So _something_ needs fixing here, but I'm not
yet sufficiently confident of my analysis to say what.

Opinions?

(as a further point, it seems to me that backupEndRequired is a rather
misleadingly named variable, since what _actually_ determines whether an
XLOG_BACKUP_END record is expected is whether backupStartPoint is set.
backupEndRequired seems to change one error message and, questionably,
one decision about whether to do crash recovery before entering archive
recovery, but nothing else.)

-- 
Andrew (irc:RhodiumToad)

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-11-13 Thread Jesper Pedersen


Hi,

On 11/12/18 6:17 PM, David Rowley wrote:

On 9 November 2018 at 19:18, Amit Langote  wrote:

I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
  typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;


Okay, done that way. v16 attached.



Still passes check-world.

Best regards,
 Jesper

Re: Continue work on changes to recovery.conf API

2018-11-13 Thread Sergei Kornilov

Hello

Thank you! Here is patch update addressing your comments.

> - useless whitespace change in xlog.c
Oops, did not notice. Fixed.

> - I think we can drop logRecoveryParameters(). Users can now inspect
> those parameters using the normal GUC mechanisms. (They were all DEBUG2
> anyway, so it's not like many users will be missing this.) Then you can
> also drop RecoveryTargetActionText().
Agreed, done.

> - Why introduce MAXRESTOREPOINTNAMELEN? If you think this is useful,
> then we could do it as a separate patch.
Reverted back. This was changed in prev proposal.

> - Make the help text change in pg_archivecleanup.c similar to pg_standby.c.
Changed.

> - In pg_basebackup.c, duplicate defines PG_AUTOCONF_FILENAME and
> STANDBY_SIGNAL_FILE. See that you can put those into a header file
> somewhere.
I move constants from xlog.h to xlog_internal.h. Also i revert back 
RECOVERY_COMMAND_FILE and RECOVERY_COMMAND_DONE into xlog.c (was moved to 
xlog.h few weeks ago)
But i have no good idea for PG_AUTOCONF_FILENAME. Seems most src/bin/ 
application uses hardcoded file path. How about miscadmin.h?

> - This stuff breaks translation strings:
>
> printf(_(" -R, --write-recovery-conf\n"
> - " write recovery.conf for
> replication\n"));
> + " append replication config to "
> PG_AUTOCONF_FILENAME "\n"
> + " and place " STANDBY_SIGNAL_FILE "
> file\n"));
>
> Use %s placeholders, or better yet replace it in a more compact way.
Maybe leave just "write configuration for replication" with explanation in 
docs, as was before?

> - The test function $node_standby->request_standby_mode() could use a
> better name. How about set_ instead of request_ ?
Sounds good, changed.

> - New string GUCs should have "" instead of NULL as their default in
> guc.c. (Not sure whether that is strictly necessary, but it seems good
> to be consistent.)
> - Help strings in guc.c should end with a period.
Fixed

> - Renaming the promote and fallback_promote files seems unnecessary for
> this patch, and I would take it out. Otherwise, the change in pg_ctl.c
> is out of date with the comment above it.
Agreed, revert back.

regards, Sergeidiff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index ee1fbd7..946239c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -611,7 +611,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 		   "  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 3fa5efd..780f40d 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1220,7 +1220,7 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
+ Create a file recovery.signal in the cluster
  data directory (see ). You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
@@ -1232,10 +1232,9 @@ SELECT pg_stop_backup();
  proceed to read through the archived WAL files it needs.  Should the
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
- of the recovery process, the server will rename
- recovery.conf to recovery.done (to prevent
- accidentally re-entering recovery mode later) and then
- commence normal database operations.
+ of the recovery process, the server will remove
+ recovery.signal (to prevent accidentally re-entering
+ recovery mode later) and then commence normal database operations.
 


@@ -1249,12 +1248,8 @@ SELECT pg_stop_backup();

 

-The key part of all this is to set up a recovery configuration file that
-describes how you want to recover and how far the recovery should
-run.  You can use recovery.conf.sample (normally
-located in the installation's share/ directory) as a
-prototype.  The one thing that you absolutely must specify in
-recovery.conf is the restore_command,
+The key part of all this is to set up a recovery configuration.
+The one thing that you absolutely must specify is the restore_command,
 which tells PostgreSQL how to retrieve archived
 WAL file segments.  Like the archive_command, this is
 a shell command string.  It can contain %f, which is
@@ -1316,7 +1311,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'

 If you want to recover to some previous point in time (say, right before
 the junior DBA dropped your main transaction table),

Re: move PartitionBoundInfo creation code

2018-11-13 Thread Alvaro Herrera

On 2018-Nov-13, Michael Paquier wrote:

> On Tue, Nov 13, 2018 at 09:58:08AM -0300, Alvaro Herrera wrote:
> > "the context that was active when the function was called" is typically
> > expressed simply as "the current memory context".  Perhaps the whole
> > phrase can be reduced to "The object returned by this function is wholly
> > allocated in the current memory context"?
> 
> Yes, that looks cleaner.  I am planning to do a last lookup round on
> tomorrow morning my time before committing, so I may still tweak a
> couple of other words...

Cool.

I gave the patch a read and it looks reasonable to me.

Memory management in RelationBuildPartitionDesc is crummy, but I don't
think it's this patch's fault.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Race condition in WaitForBackgroundWorkerStartup

2018-11-13 Thread Jeremy Finzel

On Tue, Nov 13, 2018 at 6:17 AM Amit Kapila  wrote:

> On Mon, Nov 12, 2018 at 11:55 PM Jeremy Finzel  wrote:
> >
> > I believe I found a race condition in WaitForBackgroundWorkerStartup in
> the case where it encounters an ERROR during startup.  I found that
> depending on the speed of the system, it will unreliably return either
> status BGWH_STOPPED or BGWH_STARTED.  But I can reliably reproduce getting
> BGWH_STOPPED by tweaking the worker_spi.c test module.
> >
>
> Yeah, I think it is possible that you get different values in such
> cases because we consider worker status as started after we have
> launched the worker.  Now, if we get the error in the worker
> initialization, then the user can get any of those values.  I think
> this is what is happening in your case where you are saying "ERROR
> during startup".
> Am I missing something?
>

Perhaps.  What I am saying is that some machines show ERROR during startup,
and some machines don't get an error at all, return successfully, then
immediately error and die in the background, but the client is not shown
this.  The behavior isn't predictable.  However, I can get a predictable
ERROR to happen always if I put a short pause before
WaitForBackgroundWorkerStartup.

I'm unclear on what counts as "worker initialization".  The error is
happening in the worker_spi_main function, not in the worker_spi_launch
function.  So does an immediate error in worker_spi_main count as part of
the worker initialization?

Thanks!
Jeremy

proposal - plpgsql unique statement id

2018-11-13 Thread Pavel Stehule

Hi

I am try to enhance plpgsql_check about profiler functionality and I have
to solve how to identify every plpgsql statement across different sessions.
There is lineno, but surely it should not be unique. I propose introduce
stmtid to every statement structure. This number will be unique inside
function.

Comments, notes?

Regards

Pavel
diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c
index 59460d2643..64ecfdbe2d 100644
--- a/src/pl/plpgsql/src/pl_comp.c
+++ b/src/pl/plpgsql/src/pl_comp.c
@@ -368,6 +368,8 @@ do_compile(FunctionCallInfo fcinfo,
 
 	function->fn_prokind = procStruct->prokind;
 
+	function->nstatements = 0;
+
 	/*
 	 * Initialize the compiler, particularly the namespace stack.  The
 	 * outermost namespace contains function parameters and other special
@@ -911,6 +913,8 @@ plpgsql_compile_inline(char *proc_source)
  true);
 	function->found_varno = var->dno;
 
+	function->nstatements = 0;
+
 	/*
 	 * Now parse the function's text
 	 */
@@ -1020,6 +1024,7 @@ add_dummy_return(PLpgSQL_function *function)
 
 		new = palloc0(sizeof(PLpgSQL_stmt_block));
 		new->cmd_type = PLPGSQL_STMT_BLOCK;
+		new->stmtid = function->nstatements++;
 		new->body = list_make1(function->action);
 
 		function->action = new;
@@ -1031,6 +1036,7 @@ add_dummy_return(PLpgSQL_function *function)
 
 		new = palloc0(sizeof(PLpgSQL_stmt_return));
 		new->cmd_type = PLPGSQL_STMT_RETURN;
+		new->stmtid = function->nstatements++;
 		new->expr = NULL;
 		new->retvarno = function->out_param_varno;
 
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 68e399f9cf..744156db2b 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -415,6 +415,7 @@ pl_block		: decl_sect K_BEGIN proc_sect exception_sect K_END opt_label
 
 		new->cmd_type	= PLPGSQL_STMT_BLOCK;
 		new->lineno		= plpgsql_location_to_lineno(@2);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->label		= $1.label;
 		new->n_initvars = $1.n_initvars;
 		new->initvarnos = $1.initvarnos;
@@ -919,6 +920,7 @@ stmt_call		: K_CALL
 		new = palloc0(sizeof(PLpgSQL_stmt_call));
 		new->cmd_type = PLPGSQL_STMT_CALL;
 		new->lineno = plpgsql_location_to_lineno(@1);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->expr = read_sql_stmt("CALL ");
 		new->is_call = true;
 
@@ -933,6 +935,7 @@ stmt_call		: K_CALL
 		new = palloc0(sizeof(PLpgSQL_stmt_call));
 		new->cmd_type = PLPGSQL_STMT_CALL;
 		new->lineno = plpgsql_location_to_lineno(@1);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->expr = read_sql_stmt("DO ");
 		new->is_call = false;
 
@@ -948,6 +951,7 @@ stmt_assign		: assign_var assign_operator expr_until_semi
 		new = palloc0(sizeof(PLpgSQL_stmt_assign));
 		new->cmd_type = PLPGSQL_STMT_ASSIGN;
 		new->lineno   = plpgsql_location_to_lineno(@1);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->varno = $1->dno;
 		new->expr  = $3;
 
@@ -963,6 +967,7 @@ stmt_getdiag	: K_GET getdiag_area_opt K_DIAGNOSTICS getdiag_list ';'
 		new = palloc0(sizeof(PLpgSQL_stmt_getdiag));
 		new->cmd_type = PLPGSQL_STMT_GETDIAG;
 		new->lineno   = plpgsql_location_to_lineno(@1);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->is_stacked = $2;
 		new->diag_items = $4;
 
@@ -1154,6 +1159,7 @@ stmt_if			: K_IF expr_until_then proc_sect stmt_elsifs stmt_else K_END K_IF ';'
 		new = palloc0(sizeof(PLpgSQL_stmt_if));
 		new->cmd_type	= PLPGSQL_STMT_IF;
 		new->lineno		= plpgsql_location_to_lineno(@1);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->cond		= $2;
 		new->then_body	= $3;
 		new->elsif_list = $4;
@@ -1258,6 +1264,7 @@ stmt_loop		: opt_loop_label K_LOOP loop_body
 		new = palloc0(sizeof(PLpgSQL_stmt_loop));
 		new->cmd_type = PLPGSQL_STMT_LOOP;
 		new->lineno   = plpgsql_location_to_lineno(@2);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->label	  = $1;
 		new->body	  = $3.stmts;
 
@@ -1275,6 +1282,7 @@ stmt_while		: opt_loop_label K_WHILE expr_until_loop loop_body
 		new = palloc0(sizeof(PLpgSQL_stmt_while));
 		new->cmd_type = PLPGSQL_STMT_WHILE;
 		new->lineno   = plpgsql_location_to_lineno(@2);
+		new->stmtid 	= plpgsql_curr_compile->nstatements++;
 		new->label	  = $1;
 		new->cond	  = $3;
 		new->body	  = $4.stmts;
@@ -1295,6 +1303,7 @@ stmt_for		: opt_loop_label K_FOR for_control loop_body
 
 			new = (PLpgSQL_stmt_fori *) $3;
 			new->lineno   = plpgsql_location_to_lineno(@2);
+			new->stmtid 	= plpgsql_curr_compile->nstatements++;
 			new->label	  = $1;
 			new->body	  = $4.stmts;
 			$$ = (PLpgSQL_stmt *) new;
@@ -1309,6 +1318,7 @@ stmt_for		: opt_loop_label K_FOR for_control loop_body
 			/* forq is the common supertype of all three */
 			new = (PLpgSQL_stmt_forq *) $3;

Re: move PartitionBoundInfo creation code

2018-11-13 Thread Michael Paquier

On Tue, Nov 13, 2018 at 09:58:08AM -0300, Alvaro Herrera wrote:
> "the context that was active when the function was called" is typically
> expressed simply as "the current memory context".  Perhaps the whole
> phrase can be reduced to "The object returned by this function is wholly
> allocated in the current memory context"?

Yes, that looks cleaner.  I am planning to do a last lookup round on
tomorrow morning my time before committing, so I may still tweak a
couple of other words...
--
Michael


signature.asc
Description: PGP signature

Re: New function pg_stat_statements_reset_query() to reset statistics of a specific query

2018-11-13 Thread Amit Kapila

On Tue, Nov 13, 2018 at 11:32 AM Haribabu Kommi
 wrote:
>
> On Mon, Nov 12, 2018 at 6:34 PM Amit Kapila  wrote:
>> > I can revert it back to void,
>> >
>>
>> +1, as we don't see any good reason to break backward compatibility.
>
>
> Thanks for the review.
> Attached the updated patch with return type as void.
>

With this patch, we are intending to remove the entries matching
userid, dbid, queryid from hash table (pgss_hash), but the contents of
the file (
pgss_query_texts.stat) will remain unchanged unless all of the entries
are removed from hash table.  This appears fine to me, especially
because there is no easy way to remove the contents from the file.
Does anybody see any problem with this behavior?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: doc fix for pg_stat_activity.backend_type

2018-11-13 Thread John Naylor

On 11/13/18, Amit Kapila  wrote:
> On Tue, Nov 13, 2018 at 3:37 PM John Naylor  wrote:
>>
>> On 11/13/18, Amit Kapila  wrote:
>> >
>> > Don't you need to remove background worker?
>>
>> It's handled in pgstat_get_backend_desc(), so I assumed not. If that's
>> just a place holder, then it's probably better left out, as in the
>> attached.
>>
>
> I don't think 'background worker' can be displayed as backend_type.

I think you're right (pgstatfuncs.c, starting at line 826).

-John Naylor

Re: using index or check in ALTER TABLE SET NOT NULL

2018-11-13 Thread Alvaro Herrera

On 2018-Nov-13, Dmitry Dolgov wrote:

> > On Sun, 4 Nov 2018 at 19:03, Sergei Kornilov  wrote:
> >
> > > If not properly cataloguing NOT NULL constraints would be fixed, can it
> > > potentially conflict with the current patch or not?
> > We already doing same stuff for "alter table attach partition" and in this 
> > patch i use exactly this routine. If proper cataloguing would conflict with 
> > my patch - it would conflict with "attach partition" validation too.
> > I think proper cataloguing can be implemented without conflict with 
> > proposed feature.
> 
> Yes, indeed, this patch relies on the PartConstraintImpliedByRelConstraint.
> Then maybe it makes sense to go with the solution, proposed in this thread,
> while leaving open the possibility of having "SET NOT NULL NOT VALID"? From
> the functionality point of view it definitely would be beneficial. Any other
> opinions?

I think we should ignore any SET NOT NULL NOT VALID patches, because
in practice they don't exist.  Let's move forward with this, because the
optimization being proposed is clearly very useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: move PartitionBoundInfo creation code

2018-11-13 Thread Alvaro Herrera

"the context that was active when the function was called" is typically
expressed simply as "the current memory context".  Perhaps the whole
phrase can be reduced to "The object returned by this function is wholly
allocated in the current memory context"?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal: simple query profile and tracing API

2018-11-13 Thread Pavel Stehule

út 13. 11. 2018 v 13:12 odesílatel legrand legrand <
legrand_legr...@hotmail.com> napsal:

> Hello Pavel,
>
> What about using wait events and a trigger on pg_stat_activity ?
>

pg_stat_activity should not to show fresh data. Using pg_stat_activity can
be too expensive for fast queries


> just :
>
> * create a functions to get current query signature (queryid) for a pid
>
> (not the top_level_query given for pl/pgsql blocks or triggers but the
> active one)
>
> * add some kind of active events to track planning  (in an extension with a
> planning hook)
>
> and populate some continuous views as proposed by pipelinedb (a very
> flexible solution).
>
> Yes, I know a trigger is not possible, and overhead of continuous views has
> not been verified,
> then some high frequency sampling on pg_stat_activity could help (I can
> provide examples for f_get_current_queryid(pid), active event for planning
> hook, continuous views)
>
> An other solution: a customized version of pgsentinel (for high frequency
> sampling):
>

I don't believe to sampling method - I talk about less than 10ms queries, I
would to see a 2-3ms planning time, 2-5ms waitings - and it means sampling
aboy 2ms, what is expensive

Regards

Pavel

>
> see
>
> https://www.postgresql-archive.org/Proposal-Add-accumulated-statistics-for-wait-event-td6030800.html#a6031384
>
> Regards
> PAscal
>
>
>
> --
> Sent from:
> http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
>
>

Re: Usage of pg_waldump

2018-11-13 Thread Peter Eisentraut

On 01/05/2018 18:39, David G. Johnston wrote:
> note: there is either an extra space between the two closing brackets
> or a missing space before startseg

I have committed a fix for this.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: using index or check in ALTER TABLE SET NOT NULL

2018-11-13 Thread Dmitry Dolgov

> On Sun, 4 Nov 2018 at 19:03, Sergei Kornilov  wrote:
>
> > If not properly cataloguing NOT NULL constraints would be fixed, can it
> > potentially conflict with the current patch or not?
> We already doing same stuff for "alter table attach partition" and in this 
> patch i use exactly this routine. If proper cataloguing would conflict with 
> my patch - it would conflict with "attach partition" validation too.
> I think proper cataloguing can be implemented without conflict with proposed 
> feature.

Yes, indeed, this patch relies on the PartConstraintImpliedByRelConstraint.
Then maybe it makes sense to go with the solution, proposed in this thread,
while leaving open the possibility of having "SET NOT NULL NOT VALID"? From
the functionality point of view it definitely would be beneficial. Any other
opinions?

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

2018-11-13 Thread Jürgen Strobel

On 2018-11-13 00:57, Tom Lane wrote:
> =?UTF-8?Q?J=c3=bcrgen_Strobel?=  writes:
>> On 2018-11-12 17:33, Tom Lane wrote:
>>> I'm not entirely convinced that I buy that argument, especially not in
>>> a case like this where it introduces logical inconsistencies where there
>>> otherwise wouldn't be any.
> 
>> I think I missed something, what are the *introduced* logical problems?
> 
> What to do if a partition introduces a default value that would force
> re-routing to some other partition.

I would claim this is a problem which already exists with the current
behavior:

a) If a tuple is inserted through the parent it's ignored and useless.
b) If a tuple is inserted through the partition it can lead to
inconsistency and a constraint violation (possibly rectifiable by a
re-route which I think we all agree is out of the question).

>> Apart from implementation issues the only logical problems I see are if
>> you allow to change defaults of the partition key columns, and these are
>> problematic (nonsensical really) in either case.
> 
> Just claiming that it's nonsensical doesn't fix the problem.  Also, it
> is neither problematic nor nonsensical for the root to provide defaults
> for partition key columns.  So if we go this route, we are giving up
> useful behavior (key-column defaults on the root) for useless behavior
> (key-column defaults on the partitions).

Yes, I strongly believe defaults on the key columns in partitions should
be disallowed, that's why I called them nonsensical in both cases, see
above a) and b) for why this applies even to current behavior.

Of course disallowing all DEFAULT clauses on partitions is the easiest
way out, and I'd rather see this than inserts through partitions being
more than a pure optimization.

Best regards,
Jürgen

Re: Race condition in WaitForBackgroundWorkerStartup

2018-11-13 Thread Amit Kapila

On Mon, Nov 12, 2018 at 11:55 PM Jeremy Finzel  wrote:
>
> I believe I found a race condition in WaitForBackgroundWorkerStartup in the 
> case where it encounters an ERROR during startup.  I found that depending on 
> the speed of the system, it will unreliably return either status BGWH_STOPPED 
> or BGWH_STARTED.  But I can reliably reproduce getting BGWH_STOPPED by 
> tweaking the worker_spi.c test module.
>

Yeah, I think it is possible that you get different values in such
cases because we consider worker status as started after we have
launched the worker.  Now, if we get the error in the worker
initialization, then the user can get any of those values.  I think
this is what is happening in your case where you are saying "ERROR
during startup".
Am I missing something?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: simple query profile and tracing API

2018-11-13 Thread legrand legrand

Hello Pavel,

What about using wait events and a trigger on pg_stat_activity ?

just :

* create a functions to get current query signature (queryid) for a pid

(not the top_level_query given for pl/pgsql blocks or triggers but the
active one)

* add some kind of active events to track planning  (in an extension with a
planning hook)

and populate some continuous views as proposed by pipelinedb (a very
flexible solution).

Yes, I know a trigger is not possible, and overhead of continuous views has
not been verified,
then some high frequency sampling on pg_stat_activity could help (I can
provide examples for f_get_current_queryid(pid), active event for planning
hook, continuous views)

An other solution: a customized version of pgsentinel (for high frequency
sampling):

see
https://www.postgresql-archive.org/Proposal-Add-accumulated-statistics-for-wait-event-td6030800.html#a6031384

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

RE: Copy data to DSA area

2018-11-13 Thread Ideriha, Takeshi


From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>
>On Tue, Nov 13, 2018 at 9:45 AM Robert Haas  wrote:
>> On Thu, Nov 8, 2018 at 9:05 PM Thomas Munro
>>  wrote:
>> > * I had some ideas about some kind of "allocation rollback" interface:
>> > you begin an "allocation transaction", allocate a bunch of stuff
>> > (perhaps indirectly, by calling some API that makes query plans or
>> > whatever and is totally unaware of this stuff).  Then if there is an
>> > error, whatever was allocated so far is freed in the usual cleanup
>> > paths by a rollback that happens via the resource manager machinery.
>> > If you commit, then the allocation becomes permanent.  Then you only
>> > commit stuff that you promise not to leak (perhaps stuff that has
>> > been added to a very carefully managed cluster-wide plan cache).  I
>> > am not sure of the details, and this might be crazy...
>>
>> Hmm, my first thought was that you were just reinventing memory
>> contexts, but it's really not quite the same thing, because you want
>> the allocations to end up owned by a long-lived context when you
>> succeed but a transient context when you fail.  Still, if it weren't
>> for the fact that the memory context interface is hostile to dynamic
>> shared memory's map-this-anywhere semantics, I suspect we'd try to
>> find a way to make memory contexts fit the bill, maybe by reparenting
>> contexts or even individual allocations.  You could imagine using the
>> sorts of union algorithms that are described in
>> https://en.wikipedia.org/wiki/Disjoint-set_data_structure to get very
>> low asymptotic complexity here.
>
>Yeah, I was imagining something that still works with MemoryContexts.
>Interesting idea re: unions.  I haven't got as far as thinking about how you'd 
>actually
>make that work.  But I think we're both describing the same general kind of
>semantics; you need to be able to build stuff with automatic clean-up on 
>non-local exit,
>but also keep that stuff for longer once you decide you're ready.
>
>Anyway, avoiding all the hard questions about new kinds of foot gun for now, 
>here is a
>stupid hack that shows a DSA area inside the traditional fixed-address shared 
>memory
>region, wrapped in a custom (and somewhat defective for now) MemoryContext.  It
>shows a "List"
>being manipulated by standard PostgreSQL list code that is entirely unaware 
>that it is
>working in shared memory:
>
>postgres=# call hoge_list(3);
>NOTICE:  Contents of list:
>NOTICE:   1
>NOTICE:   2
>NOTICE:   3
>CALL
>
>You can call that procedure from various different backends to add and remove
>integers from a List.  The point is that it could just as easily be a query 
>plan or any
>other palloc-based stuff in our tree, and the pointers work in any backend.

Thanks for the patch. I know it's a rough sketch but basically that's what I 
want to do. 
I tried it and it works well.

Regards,
Takeshi Ideriha

Re: doc fix for pg_stat_activity.backend_type

2018-11-13 Thread Amit Kapila

On Tue, Nov 13, 2018 at 3:37 PM John Naylor  wrote:
>
> On 11/13/18, Amit Kapila  wrote:
> >
> > Don't you need to remove background worker?
>
> It's handled in pgstat_get_backend_desc(), so I assumed not. If that's
> just a place holder, then it's probably better left out, as in the
> attached.
>

I don't think 'background worker' can be displayed as backend_type.
Do you see any way it can be displayed as backend_type?

> > +  In addition, extensions may have additional types.
> >
> > How about: "In addition, background workers registered by extensions
> > may have additional types."?
>
> Sounds good to me -- I've included this language.
>

LGTM.  I will wait for a day or so, if nobody has any comments, I will
push your patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

[PATCH] Pass COPT and PROFILE to CXXFLAGS as well

2018-11-13 Thread Christoph Berg

While working on making extension modules built reproducibly, I
noticed that extra flags passed via COPT (notably -ffile-prefix-map)
do not get added to CXXFLAGS.

This causes postgresql-hll to still store the full build path in the
.so file built: 
https://tests.reproducible-builds.org/debian/rb-pkg/unstable/amd64/diffoscope-results/postgresql-hll.html

Christoph
-- 
Senior Berater, Tel.: +49 2166 9901 187
credativ GmbH, HRB Mönchengladbach 12080, USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
>From 1c9a12d24169edb5343df28036adbf1d09a9ab2c Mon Sep 17 00:00:00 2001
From: Christoph Berg 
Date: Tue, 13 Nov 2018 11:35:26 +0100
Subject: [PATCH] Pass COPT and PROFILE to CXXFLAGS as well

The existing machinery for extending CFLAGS and LDFLAGS via COPT and
PROFILE neglected to extend CXXFLAGS as well, causing third party
extensions written in C++ not to get the extra flags.
---
 src/Makefile.global.in | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 956fd274cd..5e7eb550cc 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -671,11 +671,13 @@ endif
 #
 ifdef COPT
CFLAGS += $(COPT)
+   CXXFLAGS += $(COPT)
LDFLAGS += $(COPT)
 endif
 
 ifdef PROFILE
CFLAGS += $(PROFILE)
+   CXXFLAGS += $(PROFILE)
LDFLAGS += $(PROFILE)
 endif
 
-- 
2.19.1

proposal: simple query profile and tracing API

2018-11-13 Thread Pavel Stehule

Hi

We have good API and tools for monitoring slow queries. What is very good.

But I miss a monitoring fast queries what is usually major number and where
relatively small slowdown can to produce unhappy latencies on user
interface. More, the slowdowns here can shows some issues of database
health - bloated tables, indexes, overloading, ..

Because these queries are usually fast, the proposed interface should not
to add any hard overhead, and it should not be too complex, because simple
things are just better.

My idea is collect few metrics for any query in local memory - when query
tracing will be enabled. Now I am thinking mainly about:

* session start time
* transaction start time
* query start time
* query signature
* planning interval
* lock interval
* execution interval
* finish time
* query status
.. maybe more

These metrics can be stored in local memory and I think so collecting these
numbers should be pretty fast. Some of mentioned metrics can be taken now,
but more than one hood should be assigned.

When query will be finished - then some new hook can be executed, and there
can be a access to mentioned metrics. The hook should be evaluated under
working transaction or with own transaction if previous query fails. This
API should to work with failed, cancelled, cancelled by timeout queries
too.

Maybe similar hooks can be after transaction, and after session - where
some metrics can be processed before will be replaced for new transaction
or lost by disconnect.

What do you think about this proposal?

Regards

Pavel

Re: doc fix for pg_stat_activity.backend_type

2018-11-13 Thread John Naylor

On 11/13/18, Amit Kapila  wrote:
> On Tue, Nov 13, 2018 at 12:04 PM John Naylor  wrote:
>>
>> On 11/13/18, Amit Kapila  wrote:
>> > On Tue, Nov 13, 2018 at 5:38 AM Michael Paquier 
>> > wrote:
>> >>
>> >> On Mon, Nov 12, 2018 at 09:42:45PM +0700, John Naylor wrote:
>> >> > Looks like it. A quick search revealed "parallel worker" and
>> >> > "logical
>> >> > replication worker". src/test/modules/ also show "test_shm_mq" and
>> >> > "worker_spi", but it seems those don't need to be publicly
>> >> > documented.
>> >> > If that sounds right I'll update the patch to include the first two.
>> >>
>> >> Just wondering: do we actually need to include in the docs this list
>> >> at
>> >> all?  This is a recipe to forget its update each time a new backend
>> >> type
>> >> is added.
>> >>
>> >
>> > Sure, but how will we justify documenting (autovacuum launcher and
>> > autovacuum worker) and not (logical replication launcher and logical
>> > replication worker)?  I think we can document the type of workers that
>> > are part of core-server functionality.  We can make some generic
>> > statement on the workers that can be launched by extensions.
>>
>> How about something like the attached?
>>
>
> Don't you need to remove background worker?

It's handled in pgstat_get_backend_desc(), so I assumed not. If that's
just a place holder, then it's probably better left out, as in the
attached.

> +  In addition, extensions may have additional types.
>
> How about: "In addition, background workers registered by extensions
> may have additional types."?

Sounds good to me -- I've included this language.

-John Naylor
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index add71458e2..9600ef0d64 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -804,10 +804,13 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  text
  Type of current backend. Possible types are
   autovacuum launcher, autovacuum worker,
-  background worker, background writer,
+  logical replication launcher,
+  logical replication worker,
+  parallel worker, background writer,
   client backend, checkpointer,
   startup, walreceiver,
   walsender and walwriter.
+  In addition, background workers registered by extensions may have additional types.

RE: Copy data to DSA area

2018-11-13 Thread Ideriha, Takeshi

Thank you for the comment.

From: Thomas Munro [mailto:thomas.mu...@enterprisedb.com]
>> I'm thinking to go with plan 1. No need to think about address
>> translation seems tempting. Plan 2 (as well as plan 3) looks a big project.
>
>The existing function dsa_create_in_place() interface was intended to support 
>that,
>but has never been used in that way so I'm not sure what extra problems will 
>come up.
>Here are some assorted thoughts:
>
>* You can prevent a DSA area from creating extra DSM segments, so that it is
>constrained to stay entirely in the space you give it, by calling 
>dsa_set_size_limit(area,
>size) using the same size that you gave to dsa_create_in_place(); now you have 
>a
>DSA area that manages a single fixed-sized chunk of memory that you gave it, 
>in your
>case inside the traditional shared memory segment (but it could be anywhere,
>including inside a DSM segment or another DSA area!)
Yeah, I will use it.

>* You can probably write a MemoryContext wrapper for it, if it has only one 
>segment
>that is in the traditional shared memory segment.
>You would need to do very simple kind of address translation: the result from 
>palloc()
>needs to be base + dsa_allocate()'s result, and the argument to pfree() needs 
>to be
>subtracted from base when
>dsa_free() is called.  That is a version of your idea C that should work AFAIK.

I didn't notice that if only one segment is used dsa_get_address() is not 
needed and
simple math is enough.

>* Once you have that working, you now have a new kind of resource management
>problem on your hands: memory leaks will be cluster-wide and cluster-life-time!
>That's hard, because the goal is to be able to use arbitrary code in the tree 
>that deals
>with plans etc, but that code all assumes that it can "throw" (elog()) on 
>errors.
>PostgreSQL C is generally "garbage collected" (in a way), but in this sketch, 
>that
>doesn't work anymore: this area *never* goes out of scope and gets cleaned up.
>Generally, languages with exceptions either need garbage collection or scoped
>destructors to clean up the mess, but in this sketch we don't have that 
>anymore...
>much like allocating stuff in TopMemoryContext, except worse because it 
>doesn't go
>away when one backend exits.
>
>* I had some ideas about some kind of "allocation rollback" interface:
>you begin an "allocation transaction", allocate a bunch of stuff (perhaps 
>indirectly, by
>calling some API that makes query plans or whatever and is totally unaware of 
>this
>stuff).  Then if there is an error, whatever was allocated so far is freed in 
>the usual
>cleanup paths by a rollback that happens via the resource manager machinery.
>If you commit, then the allocation becomes permanent.  Then you only commit 
>stuff
>that you promise not to leak (perhaps stuff that has been added to a very 
>carefully
>managed cluster-wide plan cache).  I am not sure of the details, and this 
>might be
>crazy...


Can I check my understanding?
The situation you are talking about is the following:
Data structure A and B will be allocated on DSA. Data A has a pointer to B.
Another data X is already allocated on some area (might be on local heap) and 
has a pointer variable X->a,
which will be linked to A.

 old_context = MemoryContextSwitchTo(dsa_memory_context);
 A = palloc();
 B = palloc();
 A->b = B; 
 X->a = A;
 MemoryContextSwitchTo(old_context);
 
If error happens in the way of this flow, palloc'd data (A & B) should be freed 
and pointer to freed data (X->a) should be back to its original one. 
So handling these cases introduces an "transaction" API like begin_allocate() 
and end_allocate().

I'm thinking begin_allocate() starts to keeping a record of palloc'd data 
until end_allocate() is done. If error occurs, just pfree() these data. 
However, to rollback pointers we need to 
take notes of old value of the pointer. This would introduce a new API like 
"points_to_pallocd_data(pointer, new_data)" to remember old_value and do `X->a 
= A`. 
To implement them I still need further consideration about how to fit or extend 
existing MemoryContext machinery.

Regards,
Takeshi Ideriha

Re: BUG #15449: file_fdw using program cause exit code error when using LIMIT

2018-11-13 Thread Etsuro Fujita


(2018/11/12 20:41), Etsuro Fujita wrote:

(2018/11/12 18:52), Kyotaro HORIGUCHI wrote:

I read the Thomas's messages as "TTIO/TTOU are not
needed to our busines and we don't have a reason to restore them
before calling external programs other than just plaster
seemingly consistency." And I agree to the analysis and I agree
to you on the point that it doens't need consideration just now.


OK


Attached is an updated version of Tom's POC patch.  Here are changes:

* I modified his patch so that the called program inherits SIG_DFL for 
SIGUSR2 as well, as discussed upthread.


* I think it's better to ignore the SIGPIPE failure in 
ClosePipeToProgram if we were in a COPY FROM PROGRAM that was allowed to 
terminate early and keep the behavior as-is otherwise.  If we ignore 
that failure unconditionally in that function, eg, COPY TO PROGRAM would 
fail to get a (better) error message in CopySendEndOfRow or EndCopy when 
the invoked program was terminated on SIGPIPE, as discussed before [1]. 
 And to do so, I added a new argument to BeginCopyFrom to specify 
whether COPY FROM PROGRAM can terminate early or not.


Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5BE18409.2070004%40lab.ntt.co.jp
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 678,683  fileBeginForeignScan(ForeignScanState *node, int eflags)
--- 678,684 
  		   filename,
  		   is_program,
  		   NULL,
+ 		   true,
  		   NIL,
  		   options);
  
***
*** 754,759  fileReScanForeignScan(ForeignScanState *node)
--- 755,761 
  	festate->filename,
  	festate->is_program,
  	NULL,
+ 	true,
  	NIL,
  	festate->options);
  }
***
*** 1117,1124  file_acquire_sample_rows(Relation onerel, int elevel,
  	/*
  	 * Create CopyState from FDW options.
  	 */
! 	cstate = BeginCopyFrom(NULL, onerel, filename, is_program, NULL, NIL,
! 		   options);
  
  	/*
  	 * Use per-tuple memory context to prevent leak of memory used to read
--- 1119,1126 
  	/*
  	 * Create CopyState from FDW options.
  	 */
! 	cstate = BeginCopyFrom(NULL, onerel, filename, is_program, NULL, false,
! 		   NIL, options);
  
  	/*
  	 * Use per-tuple memory context to prevent leak of memory used to read
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 119,124  typedef struct CopyStateData
--- 119,125 
  	int			file_encoding;	/* file or remote side's character encoding */
  	bool		need_transcoding;	/* file encoding diff from server? */
  	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
+ 	bool		allow_partial;	/* true if partial execution is allowed */
  
  	/* parameters from the COPY command */
  	Relation	rel;			/* relation to copy to or from */
***
*** 998,1004  DoCopy(ParseState *pstate, const CopyStmt *stmt,
  		PreventCommandIfParallelMode("COPY FROM");
  
  		cstate = BeginCopyFrom(pstate, rel, stmt->filename, stmt->is_program,
! 			   NULL, stmt->attlist, stmt->options);
  		*processed = CopyFrom(cstate);	/* copy from file to database */
  		EndCopyFrom(cstate);
  	}
--- 999,1005 
  		PreventCommandIfParallelMode("COPY FROM");
  
  		cstate = BeginCopyFrom(pstate, rel, stmt->filename, stmt->is_program,
! 			   NULL, false, stmt->attlist, stmt->options);
  		*processed = CopyFrom(cstate);	/* copy from file to database */
  		EndCopyFrom(cstate);
  	}
***
*** 1727,1737  ClosePipeToProgram(CopyState cstate)
--- 1728,1749 
  (errcode_for_file_access(),
   errmsg("could not close pipe to external command: %m")));
  	else if (pclose_rc != 0)
+ 	{
+ 		/*
+ 		 * If we were in a COPY FROM PROGRAM that was allowed to terminate
+ 		 * early and the called program terminated on SIGPIPE, assume it's OK;
+ 		 * we must have chosen to stop reading its output early.
+ 		 */
+ 		if (cstate->allow_partial &&
+ 			WIFSIGNALED(pclose_rc) && WTERMSIG(pclose_rc) == SIGPIPE)
+ 			return;
+ 
  		ereport(ERROR,
  (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
   errmsg("program \"%s\" failed",
  		cstate->filename),
   errdetail_internal("%s", wait_result_to_str(pclose_rc;
+ 	}
  }
  
  /*
***
*** 3184,3189  BeginCopyFrom(ParseState *pstate,
--- 3196,3202 
  			  const char *filename,
  			  bool is_program,
  			  copy_data_source_cb data_source_cb,
+ 			  bool allow_partial,
  			  List *attnamelist,
  			  List *options)
  {
***
*** 3301,3306  BeginCopyFrom(ParseState *pstate,
--- 3314,3320 
  	cstate->volatile_defexprs = volatile_defexprs;
  	cstate->num_defaults = num_defaults;
  	cstate->is_program = is_program;
+ 	cstate->allow_partial = allow_partial;
  
  	if (data_source_cb)
  	{
*** a/src/backend/replication/logical/tablesync.c
--- b/src/backend/replication/logical/tablesync.c
***
*** 799,805  copy_table(Relation rel)

Re: doc fix for pg_stat_activity.backend_type

2018-11-13 Thread Amit Kapila

On Tue, Nov 13, 2018 at 12:04 PM John Naylor  wrote:
>
> On 11/13/18, Amit Kapila  wrote:
> > On Tue, Nov 13, 2018 at 5:38 AM Michael Paquier 
> > wrote:
> >>
> >> On Mon, Nov 12, 2018 at 09:42:45PM +0700, John Naylor wrote:
> >> > Looks like it. A quick search revealed "parallel worker" and "logical
> >> > replication worker". src/test/modules/ also show "test_shm_mq" and
> >> > "worker_spi", but it seems those don't need to be publicly documented.
> >> > If that sounds right I'll update the patch to include the first two.
> >>
> >> Just wondering: do we actually need to include in the docs this list at
> >> all?  This is a recipe to forget its update each time a new backend type
> >> is added.
> >>
> >
> > Sure, but how will we justify documenting (autovacuum launcher and
> > autovacuum worker) and not (logical replication launcher and logical
> > replication worker)?  I think we can document the type of workers that
> > are part of core-server functionality.  We can make some generic
> > statement on the workers that can be launched by extensions.
>
> How about something like the attached?
>

Don't you need to remove background worker?

+  In addition, extensions may have additional types.

How about: "In addition, background workers registered by extensions
may have additional types."?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Control your disk usage in PG: Introduction to Disk Quota Extension

2018-11-13 Thread Hubert Zhang

Hi all,

We implement disk quota feature on Postgresql as an extension(link:
https://github.com/greenplum-db/diskquota),
If you are interested, try and use it to limit the amount of disk space that
a schema or a role can use in Postgresql.
Any feedback or question are appreciated.

Overview

Diskquota is an extension that provides disk usage enforcement for database
objects in Postgresql. Currently it supports to set quota limit on schema
and role in a given database and limit the amount of disk space that a
schema or a role can use.

This project is inspired by Heikki's pg_quota project (link:
https://github.com/hlinnaka/pg_quota) and enhance it to support different
kinds of DDL and DML which may change the disk usage of database objects.

Diskquota is a soft limit of disk uages. It has some delay to detect the
schemas or roles whose quota limit is exceeded. Here 'soft limit' supports
two kinds of encforcement: Query loading data into out-of-quota schema/role
will be forbidden before query is running. Query loading data into
schema/role with rooms will be cancelled when the quota limit is reached
dynamically during the query is running.

Design

Diskquota extension is based on background worker framework in Postgresql.
There are two kinds of background workers: diskquota launcher and diskquota
worker.

There is only one laucher process per database cluster(i.e. one laucher per
postmaster). Launcher process is reponsible for manage worker processes:
Calling RegisterDynamicBackgroundWorker() to create new workers and keep
their handle. Calling TerminateBackgroundWorker() to terminate workers
which are disabled when DBA modify diskquota.monitor_databases

There are many worker processes, one for each database which is listed in
diskquota.monitor_databases. Currently, we support to monitor at most 10
databases at the same time. Worker processes are responsible for monitoring
the disk usage of schemas and roles for the target database, and do quota
enfocement. It will periodically (can be set via diskquota.naptime)
recalcualte the table size of active tables, and update their corresponding
schema or owner's disk usage. Then compare with quota limit for those
schemas or roles. If exceeds the limit, put the corresponding schemas or
roles into the blacklist in shared memory. Schemas or roles in blacklist
are used to do query enforcement to cancel queries which plan to load data
into these schemas or roles.
Active
table

Active tables are the tables whose table size may change in the last quota
check interval. We use hooks in smgecreate(), smgrextend() and
smgrtruncate() to detect active tables and store them(currently
relfilenode) in the shared memory. Diskquota worker process will
periodically consuming active table in shared memories, convert relfilenode
to relaton oid, and calcualte table size by calling
pg_total_relation_size(), which will sum the size of table(including: base,
vm, fsm, toast and index).

Enforcement

Enforcement is implemented as hooks. There are two kinds of enforcement
hooks: enforcement before query is running and enforcement during query is
running. The 'before query' one is implemented at ExecutorCheckPerms_hook
in function ExecCheckRTPerms() The 'during query' one is implemented at
BufferExtendCheckPerms_hook in function ReadBufferExtended(). Note that the
implementation of BufferExtendCheckPerms_hook will firstly check whether
function request a new block, if not skip directyly.
Quota
setting store

Quota limit of a schema or a role is stored in table 'quota_config' in
'diskquota' schema in monitored database. So each database stores and
manages its own disk quota configuration. Note that although role is a db
object in cluster level, we limit the diskquota of a role to be database
specific. That is to say, a role may has different quota limit on different
databases and their disk usage is isolated between databases.

Install

   1. Add hook functions to Postgres by applying patch. It's required since
   disk quota need to add some new hook functions in postgres core. This step
   would be skipped after patch is merged into postgres in future.

# install patch into postgres_src and rebuild postgres.
cd postgres_src;
git apply $diskquota_src/patch/pg_hooks.patch;
make;
make install;


   1. Compile and install disk quota.

cd $diskquota_src;
make;
make install;


   1. Config postgresql.conf

# enable diskquota in preload library.
shared_preload_libraries = 'diskquota'
# set monitored databases
diskquota.monitor_databases = 'postgres'
# set naptime (second) to refresh the disk quota stats periodically

Re: Undo logs

2018-11-13 Thread Amit Kapila

On Wed, Oct 17, 2018 at 3:27 PM Amit Kapila  wrote:
>
> On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila  wrote:
> >
> > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> >  wrote:
> > > I have also pushed a new WIP version of the lower level undo log
> > > storage layer patch set to a public branch[1].  I'll leave the earlier
> > > branch[2] there because the record-level patch posted by Dilip depends
> > > on it for now.
> > >
> >
>
> Till now, I have mainly reviewed undo log allocation part.  This is a
> big patch and can take much more time to complete the review.  I will
> review the other parts of the patch later.
>

I think I see another issue with this patch.  Basically, during
extend_undo_log, there is an assumption that no one could update
log->meta.end concurrently, but it
is not true as the same can be updated by UndoLogDiscard which can
lead to assertion failure in extend_undo_log.

+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * Create all the segments needed to increase 'end' to the requested
+ * size.  This is quite expensive, so we will try to avoid it completely
+ * by renaming files into place in UndoLogDiscard instead.
+ */
+ end = log->meta.end;
+ while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
..
+ Assert(end == new_end);
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.
+ *
+ * XXX It's possible for meta.end to be higher already during
+ * recovery, because of the timing of a checkpoint; in that case we did
+ * nothing above and we shouldn't update shmem here.  That interaction
+ * needs more analysis.
+ */
+ LWLockAcquire(>mutex, LW_EXCLUSIVE);
+ if (log->meta.end < end)
+ log->meta.end = end;
+ LWLockRelease(>mutex);
..
}

Assume, before we read log->meta.end in above code, concurrently,
discard process discards the undo and moves the end pointer to a later
location, then the above assertion will fail.  Now, if discard happens
just after we read log->meta.end and before we do other stuff in this
function, then it will crash in recovery.

Can't we just remove this Assert?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: PostgreSQL Limits and lack of documentation about them.

2018-11-13 Thread David Rowley

On 13 November 2018 at 19:46, John Naylor  wrote:
> On 11/8/18, David Rowley  wrote:
>> On 8 November 2018 at 22:46, Peter Eisentraut
>>  wrote:
>>> Could you adjust this to use fewer capital letters, unless they start
>>> sentences or similar?
>>
>> Yeah. Changed in the attached.
>
> Looks good to me. Since there have been no new suggestions for a few
> days, I'll mark it ready for committer.

Thanks for your review.  I don't think these initially need to include
100% of the limits. If we stumble on things later that seem worth
including, we'll have a place to write them down.

The only other thing that sprung to my mind was the maximum tables per
query.  This is currently limited to 64999, not including double
counting partitioned tables and inheritance parents, but I kinda think
of we feel the need to document it, then we might as well just raise
the limit.  It seems a bit arbitrarily set at the moment. I don't see
any reason it couldn't be higher. Although, if it was too high we'd
start hitting things like palloc() size limits on simple_rte_array.
I'm inclined to not bother mentioning it.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

83 matches

Mail list logo