Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-24 Thread John Naylor
On Wed, Apr 24, 2019 at 1:58 PM Amit Kapila  wrote:

> The two improvements in this code which are discussed in this thread
> and can be done independently to this patch are:
> a. use one bit to represent each block in the map.  This gives us the
> flexibility to use the map for the different threshold for some other
> storage.
> b. improve the usage of smgrexists by checking smgr_fsm_nblocks.
>
> John, can you implement these two improvements either on HEAD or on
> top of this patch?

I've done B in the attached. There is a more recent idea of using the
byte to store the actual free space in the same format as the FSM.
That might be v13 material, but in any case, I'll hold off on A for
now.


-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


optimize-smgrexists.patch
Description: Binary data


Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-24 Thread Amit Kapila
On Wed, Apr 24, 2019 at 9:49 PM Andres Freund  wrote:
> On 2019-04-24 11:28:32 +0530, Amit Kapila wrote:
> > On Tue, Apr 23, 2019 at 10:59 PM Andres Freund  wrote:
> > > > > On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> > > > I think we should first try to see in this new scheme (a) when to set
> > > > the map, (b) when to clear it, (c) how to use.  I have tried to
> > > > summarize my thoughts about it, let me know what do you think about
> > > > the same?
> > > >
> > > > When to set the map.
> > > > At the beginning (the first time relation is used in the backend), the
> > > > map will be clear.  When the first time in the backend, we find that
> > > > FSM doesn't exist and the number of blocks is lesser than
> > > > HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
> > > > exist at that time and mark all or alternate blocks as available.
> > >
> > > I think the alternate blocks scheme has to go. It's not defensible.
> > >
> >
> > Fair enough, I have changed it in the attached patch.  However, I
> > think we should test it once the patch is ready as we have seen a
> > small performance regression due to that.
>
> Sure, but that was because you re-scanned from scratch after every
> insertion, no?
>

Possible.

>
> > > And sure, leaving that aside we could store one byte per block
> >
> > Hmm, I think you mean to say one-bit per block, right?
>
> No, I meant byte.
>

Upthread you have said: "Hm, given realistic
HEAP_FSM_CREATION_THRESHOLD, and the fact that we really only need one
bit per relation, it seems like map should really be just a uint32
with one bit per page.   It'd allow different AMs to have different
numbers of dont-create-fsm thresholds without needing additional
memory (up to 32 blocks)."

I can understand the advantage of one-bit per-page suggestion, but now
you are telling one-byte per-page.  I am confused between those two
options.  Am, I missing something?

> The normal FSM saves how much space there is for a
> block, but the current local fsm doesn't. That means pages are marked as
> unavailble even though other tuples would possibly fit.
>

Sure, in regular FSM, the vacuum can update the available space, but
we don't have such a provision for local map unless we decide to keep
it in shared memory.

>
> > >  It's possible that'd come with
> > > some overhead - I've not thought sufficiently about it: I assume we'd
> > > still start out in each backend assuming each page is empty, and we'd
> > > then rely on RelationGetBufferForTuple() to update that. What I wonder
> > > is if we'd need to check if an on-disk FSM has been created every time
> > > the space on a page is reduced?  I think not, if we use invalidations to
> > > notify others of the existance of an on-disk FSM. There's a small race,
> > > but that seems ok.
>
> > Do you mean to say that vacuum or some backend should invalidate in
> > case it first time creates the FSM for a relation?
>
> Right.
>
>
> > I think it is quite possible that the vacuum takes time to trigger
> > such invalidation if there are fewer deletions.  And we won't be able
> > to use newly added page/s.
>
> I'm not sure I understand what you mean by that? If the backend that
> ends up creating the FSM - because it extended the relation beyond 4
> pages - issues an invalidation, the time till other backends pick that
> up should be minimal?
>

Consider when a backend-1 starts inserting into a relation, it has
just two pages and we create a local map in the relation which has two
pages.  Now, backend-2 extends the relation by one-page, how and when
will backend-1 comes to know about that.  One possibility is that once
all the pages present in backend-1's relation becomes invalid
(used-up), we again check the number of blocks and update the local
map.

>
> > IIUC, you are suggesting to issue invalidations when the (a) vacuum
> > finds there is no FSM and page has more space now, (b) invalidation to
> > notify the existence of FSM.  IT seems to me that issuing more
> > invalidations for the purpose of FSM can lead to an increased number
> > of relation builds in the overall system.  I think this can have an
> > impact when there are many small relations in the system which in some
> > scenarios might not be uncommon.
>
> If this becomes an issue I'd just create a separate type of
> invalidation, one that just signals that the FSM is being invalidated.
>

Oh, clever idea, but I guess that will be some work unless we already
do something similar elsewhere.  Anyway, we can look into it later.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-24 Thread Paul Guo
On Wed, Apr 24, 2019 at 10:36 PM Tom Lane  wrote:

> Alvaro Herrera  writes:
> > On 2019-Apr-24, Paul Guo wrote:
> >> On Wed, Apr 24, 2019 at 12:49 PM Andres Freund 
> wrote:
> >>> This seems like a bad idea to me. IMO we want to support using a pipe
> >>> etc here. If the admin creates a fifo like this without attaching a
> >>> program it seems like it's their fault.
>
> >> Oh, I never know this application scenario before. So yes, for this, we
> >> need to keep the current code logic in copy code.
>
> > But the pgstat.c patch seems reasonable to me.
>
> Nah, I don't buy that one either.  Nobody has any business creating any
> non-Postgres files in the stats directory ... and if somebody does want
> to stick a FIFO in there, perhaps for debug purposes, why should we stop
> them?
>

For the pgstat case, the files for AllocateFile() are actually temp files
which
are soon renamed to other file names. Users might not want to set them as
fifo files.
For developers 'tail -f' might be sufficient for debugging purpose.

  const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE :
pgstat_stat_tmpname;
  fpout = AllocateFile(tmpfile, PG_BINARY_W);
  fwrite(fpout, ...);
  rename(tmpfile, statfile);

I'm not sure if those hardcoded temp filenames (not just those in pgstat)
are used across postgres reboot.
If no, we should instead call glibc function to create unique temp files
and also remove those hardcode temp
filename variables, else we also might want them to be regular files.


> The case with COPY is a bit different, since there it's reasonable to be
> worried about collisions with other users' files --- but I agree with
> Andres that this change would eliminate too many valid use-cases.
>
> regards, tom lane
>


Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-04-24 Thread Amit Langote
On 2019/04/25 8:27, Tom Lane wrote:
> Amit Langote  writes:
>> On 2019/04/24 7:03, Tom Lane wrote:
>>> ISTM we could work around the problem with the attached, which I think
>>> is a good change independently of anything else.
> 
>> Agreed.
> 
> After thinking about it more, it seems like a bad idea to put the check
> in CheckIndexCompatible; that could interfere with any other use of the
> function to match up potential child indexes.

CheckIndexCompatible() seems to be invented solely for the purposes of
post-ALTER-COLUMN-TYPE, which is what I get from the header comment of
that function:

 * This is tailored to the needs of ALTER TABLE ALTER TYPE, which recreates
 * any indexes that depended on a changing column from their pg_get_indexdef
 * or pg_get_constraintdef definitions.  We omit some of the sanity checks of
 * DefineIndex.  We assume that the old and new indexes have the same number
 * of columns and that if one has an expression column or predicate, both do.
 * Errors arising from the attribute list still apply.

Given that, it may have been better to keep the function local to
tablecmds.c to avoid potential misuse; I see no other callers of this
function beside TryReuseIndex(), which in turn is only used by
post-ALTER-COLUMN-TYPE processing.

> (I wonder why this test
> isn't the same as what DefineIndex uses to spot potential child indexes,
> BTW --- it uses a completely separate function CompareIndexInfo, which
> seems both wasteful and trouble waiting to happen.)

I don't think that CheckIndexCompatible's job is to spot child indexes
compatible with a given parent index; that's CompareIndexInfo's job.  The
latter was invented for the partition index DDL code, complete with
provisions to do mapping between parent and child attributes before
matching if their TupleDescs are different.

CheckIndexCompatible, as mentioned above, is to do the errands of
ATPostAlterTypeParse().

/*
 * CheckIndexCompatible
 *  Determine whether an existing index definition is compatible with a
 *  prospective index definition, such that the existing index storage
 *  could become the storage of the new index, avoiding a rebuild.

So, maybe a bigger (different?) charter than CompareIndexInfo's, or so I
think.  Although, they may be able share the code.

> So I think we should test the relkind in TryReuseIndex, instead.

Which would work too.  Or we could just not call TryReuseIndex() if the
the index in question is partitioned.

> I think it would be a good idea to *also* do what you suggested
> upthread and have DefineIndex clear the field when cloning an
> IndexStmt to create a child; no good could possibly come of
> passing that down when we intend to create a new index.

Note that oldNode is only set by TryReuseIndex() today.  Being careful in
DefineIndex might be a good idea anyway.

> In short, I think the right short-term fix is as attached (plus
> a new regression test case based on the submitted example).

Sounds good.

> Longer term, it's clearly bad that we fail to reuse child indexes
> in this scenario; the point about mangled comments is minor by
> comparison.

Agreed.

> I'm inclined to think that what we want to do is
> *not* recurse when creating the parent index, but to initially
> make it NOT VALID, and then do ALTER ATTACH PARTITION with each
> re-used child index.  This would successfully reproduce the
> previous state in case only some of the partitions have attached
> indexes, which I don't think works correctly right now.

Well, an index in the parent must be present in all partitions, so I don't
understand which case you're thinking of.

> BTW, I hadn't ever looked closely at what the index reuse code
> does, and now that I have, my heart just sinks.  I think that
> logic needs to be nuked from orbit.  RelationPreserveStorage was
> never meant to be abused in this way; I invite you to contemplate
> whether it's not a problem that it doesn't check the backend and
> nestLevel fields of PendingRelDelete entries before killing them.
> (In its originally-designed use for mapped rels at transaction end,
> that wasn't a problem, but I'm afraid that it may be one now.)
> 
> The right way to do this IMO would be something closer to the
> heap-swap logic in cluster.c, where we exchange the relfilenodes
> of two live indexes, rather than what is happening now.  Or for
> that matter, do we really need to delete the old indexes at all?

Yeah, we wouldn't need TryReuseIndex and subsequent
RelationPreserveStorage if we hadn't dropped the old indexes to begin
with.  Instead, in ATPostAlterTypeParse, check if the index after ALTER
TYPE is still incompatible (CheckIndexCompatible) and if it is, don't add
a new AT_ReAddIndex command.  If it's not, *then* drop the index, and
recreate the index from scratch using an IndexStmt generated from the old
index definition.  I guess We can get rid of IndexStmt.oldNode too.

> None of that looks very practical for v12, let alone back-patching
> to v11, 

Re: Help to review the with X cursor option.

2019-04-24 Thread alex lock
On Thu, Apr 25, 2019 at 9:53 AM alex lock  wrote:

>
>
> that's something I want to change,  as I said at the beginning.  include
> avoid some memory release (like the EState and so on),  snapshot release.
>
>

I check my original statement, I found "snapshot release" was missed,  that
obviously is a key point..


Re: Help to review the with X cursor option.

2019-04-24 Thread alex lock
On Wed, Apr 24, 2019 at 11:30 PM Tom Lane  wrote:

> alex lock  writes:
> > The cursor means  something like  declare c cursor for select * from t;
> > The holdable cursor means declare c cursor WITH HOLD for select * from t;
>
> > Holdable cursor is good at transaction,  user can still access it after
> the
> > transaction is commit.  But it is bad at it have to save all the record
> to
> > tuple store before we fetch 1 row.
>
> > what I want is:
> > 1.   The cursor is still be able to fetch after the transaction is
> > committed.
> > 2.   the cursor will not fetch the data when fetch statement is issue
> (just
> > like non-holdable cursor).
>
> > I called this as with X cursor..
>
> > I check the current implementation and think it would be possible with
> the
> > following methods:
> > 1.   allocate the memory  in a  {LongerMemoryContext}, like EState  to
> > prevent they are
> > 2.   allocate a more bigger resource owner to prevent the LockReleaseAll
> > during CommitTransaction.
> > 3.   add the "with X" option to cursor so that Precommit_portals will not
> > drop it during CommitTransaction.
>
> > Before I implement it,  could you give some suggestions?
>
> You don't actually understand the problem.


>
Thanks tones.  I know that and that's just something I want to change.


> The reason a holdable cursor forcibly reads all the data before commit is
> that the data might not be there to read any later than that.


I think this can be done with snapshot read, like we want the data  at time
1, even the data is not there at time 2,  we provide the snapshot,  we can
read the data.  Oracle has a similar function called flashback query
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm#1008580
 .


> Once we end
> the transaction and release its snapshot (specifically, advance the
> backend's advertised global xmin), it's possible and indeed desirable for
> obsoleted row versions to be vacuumed.


that's something I want to change,  as I said at the beginning.  include
avoid some memory release (like the EState and so on),  snapshot release.



> The only way to avoid that would
> be to not advance xmin, which is pretty much just as bad as not committing
> the transaction.


there is something different between "not advance xmin" or "not committing
the transaction" for me.   "not commit the transaction" will take up the
connection,  but "not advance xmin" one not.   without this reason,
non-holdable cursor is good for me.


> Not releasing the transaction's locks is also bad.


Assume that if the table was dropped among the fetches, we can just raise
error,  we can releasing the lock?  I am still not sure about this part,
but keep the lock is still acceptable for me since it will not take up the
connection already(my purpose).   but releasing the lock can be better.


> So it doesn't seem like there's anything to be gained here that you don't
> have today by just not committing yet.
>

it is connection:)  I want to run dml or other stuff on the current
connection.


>
> If you're concerned about not losing work due to possible errors later in
> the transaction, you could prevent those from causing problems through
> subtransactions (savepoints).
>
> Thanks for your tip,  I have thought the possibility but I can think
more.  the business model is a bit of complex and I don't want to talk more
here.


> regards, tom lane
>


Re: set relispartition when attaching child index

2019-04-24 Thread Amit Langote
On 2019/04/25 0:55, Amit Langote wrote:
> On Thu, Apr 25, 2019 at 12:39 AM Amit Langote  wrote:
>> On Thu, Apr 25, 2019 at 12:38 AM Amit Langote  
>> wrote:
>>> On Thu, Apr 25, 2019 at 12:35 AM Alvaro Herrera
>>>  wrote:
 On 2019-Apr-25, Amit Langote wrote:

> It seems that DefineIndex() is forgetting to update_relispartition()
> on a partition's index when it's attached to an index being added to
> the parent.  That results in unexpected behavior when adding a foreign
> key referencing the parent.

 BTW, maybe IndexSetParentIndex ought to be the one calling
 update_relispartition() ...
>>>
>>> I thought so too, but other sites are doing what I did in the patch.
>>
>> Although, we wouldn't have this bug if it was IndexSetParentIndex
>> calling it.  Maybe a good idea to do that now.
> 
> I tried that in the attached.

BTW, this will need to be back-patched to 11.

Thanks,
Amit





Re: Identity columns should own only one sequence

2019-04-24 Thread Michael Paquier
On Sun, Apr 14, 2019 at 05:51:47PM +0200, Laurenz Albe wrote:
> test=> INSERT INTO ser (id) VALUES (DEFAULT);
> ERROR:  more than one owned sequence found

Yes this should never be user-triggerable, so it seems that we need to
fix and back-patch something if possible.

> I propose that we check if there already is a dependent sequence
> before adding an identity column.

That looks awkward.  Souldn't we make sure that when dropping the
default associated with a serial column then the dependency between
the column and the sequence is removed instead?  This implies more
complication in ATExecColumnDefault().

> The attached patch does that, and also forbids setting the ownership
> of a sequence to an identity column.
> 
> I think this should be backpatched.

Could you add some test cases with what you think is adapted?
--
Michael


signature.asc
Description: PGP signature


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-24 Thread Tom Lane
Peter Geoghegan  writes:
> ... I'm not
> seeing why the context that the PrepareTempTablespaces() catalog
> access occurs in actually matters.

The point there is that a catalog access might leak some amount of
memory.  Probably not enough to be a big deal, but ...

regards, tom lane




Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 12:17 PM Tom Lane  wrote:
> After a bit more thought it seemed like another answer would be to
> make all three of those functions assert that the caller did the
> right thing, as per attached.  This addresses the layering-violation
> complaint, but might be more of a pain in the rear for developers.

In what sense is it not already a layering violation to call
PrepareTempTablespaces() as often as we do? PrepareTempTablespaces()
parses and validates the GUC variable and passes it to fd.c, but to me
that seems almost the same as calling the fd.c function
SetTempTablespaces() directly. PrepareTempTablespaces() allocates
memory that it won't free itself within TopTransactionContext. I'm not
seeing why the context that the PrepareTempTablespaces() catalog
access occurs in actually matters.

Like you, I find it hard to prefer one of the approaches over the
other, though I don't really know how to assess this layering
business. I'm glad that either approach will prevent oversights,
though.
-- 
Peter Geoghegan




Re: Why is it OK for dbsize.c to look at relation files directly?

2019-04-24 Thread Michael Paquier
On Wed, Apr 24, 2019 at 05:18:08PM -0700, Andres Freund wrote:
> Just so it's guaranteed that it'll never happen? :)

Items get done, from time to time...  b0eaa4c is one rare example :p

/me runs and hides.
--
Michael


signature.asc
Description: PGP signature


Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-24 Thread Michael Paquier
On Wed, Apr 24, 2019 at 10:36:03AM -0400, Tom Lane wrote:
> Nah, I don't buy that one either.  Nobody has any business creating any
> non-Postgres files in the stats directory ... and if somebody does want
> to stick a FIFO in there, perhaps for debug purposes, why should we stop
> them?

I have never used a FIFO in Postgres for debugging purposes, but that
sounds plausible.  I am not sure either the changes proposed in the
patch are a good idea.
--
Michael


signature.asc
Description: PGP signature


Re: Why is it OK for dbsize.c to look at relation files directly?

2019-04-24 Thread Andres Freund
On 2019-04-25 09:16:50 +0900, Michael Paquier wrote:
> Perhaps you should add a TODO item for that?

Just so it's guaranteed that it'll never happen? :)




Re: Why is it OK for dbsize.c to look at relation files directly?

2019-04-24 Thread Michael Paquier
On Wed, Apr 24, 2019 at 04:09:56PM -0700, Andres Freund wrote:
> I think we should change dbsize.c to call
> RelationGetNumberOfBlocksInFork() for relkinds != TABLE/TOAST/MATVIEW,
> and a new AM callback for those. Probably with the aforementioned
> additional logic of closing smgr references if they weren't open before
> the size computation.
> 
> Imo this pretty clearly is v13 work.

Agreed that this is out of 12's scope.  Perhaps you should add a TODO
item for that?
--
Michael


signature.asc
Description: PGP signature


Re: BUG #15672: PostgreSQL 11.1/11.2 crashed after dropping a partition table

2019-04-24 Thread Tom Lane
Amit Langote  writes:
> On 2019/04/24 7:03, Tom Lane wrote:
>> ISTM we could work around the problem with the attached, which I think
>> is a good change independently of anything else.

> Agreed.

After thinking about it more, it seems like a bad idea to put the check
in CheckIndexCompatible; that could interfere with any other use of the
function to match up potential child indexes.  (I wonder why this test
isn't the same as what DefineIndex uses to spot potential child indexes,
BTW --- it uses a completely separate function CompareIndexInfo, which
seems both wasteful and trouble waiting to happen.)

So I think we should test the relkind in TryReuseIndex, instead.
I think it would be a good idea to *also* do what you suggested
upthread and have DefineIndex clear the field when cloning an
IndexStmt to create a child; no good could possibly come of
passing that down when we intend to create a new index.

In short, I think the right short-term fix is as attached (plus
a new regression test case based on the submitted example).

Longer term, it's clearly bad that we fail to reuse child indexes
in this scenario; the point about mangled comments is minor by
comparison.  I'm inclined to think that what we want to do is
*not* recurse when creating the parent index, but to initially
make it NOT VALID, and then do ALTER ATTACH PARTITION with each
re-used child index.  This would successfully reproduce the
previous state in case only some of the partitions have attached
indexes, which I don't think works correctly right now.

BTW, I hadn't ever looked closely at what the index reuse code
does, and now that I have, my heart just sinks.  I think that
logic needs to be nuked from orbit.  RelationPreserveStorage was
never meant to be abused in this way; I invite you to contemplate
whether it's not a problem that it doesn't check the backend and
nestLevel fields of PendingRelDelete entries before killing them.
(In its originally-designed use for mapped rels at transaction end,
that wasn't a problem, but I'm afraid that it may be one now.)

The right way to do this IMO would be something closer to the
heap-swap logic in cluster.c, where we exchange the relfilenodes
of two live indexes, rather than what is happening now.  Or for
that matter, do we really need to delete the old indexes at all?

None of that looks very practical for v12, let alone back-patching
to v11, though.

regards, tom lane

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a1c91b5..002a8b8 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1125,6 +1125,18 @@ DefineIndex(Oid relationId,
 	ListCell   *lc;
 
 	/*
+	 * We can't use the same index name for the child index,
+	 * so clear idxname to let the recursive invocation choose
+	 * a new name.  Likewise, the existing target relation
+	 * field is wrong, and if indexOid or oldNode are set,
+	 * they mustn't be applied to the child either.
+	 */
+	childStmt->idxname = NULL;
+	childStmt->relation = NULL;
+	childStmt->indexOid = InvalidOid;
+	childStmt->oldNode = InvalidOid;
+
+	/*
 	 * Adjust any Vars (both in expressions and in the index's
 	 * WHERE clause) to match the partition's column numbering
 	 * in case it's different from the parent's.
@@ -1155,8 +1167,6 @@ DefineIndex(Oid relationId,
 	if (found_whole_row)
 		elog(ERROR, "cannot convert whole-row table reference");
 
-	childStmt->idxname = NULL;
-	childStmt->relation = NULL;
 	DefineIndex(childRelid, childStmt,
 InvalidOid, /* no predefined OID */
 indexRelationId,	/* this is our child */
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index aa7328e..66f863f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11454,7 +11454,9 @@ TryReuseIndex(Oid oldId, IndexStmt *stmt)
 	{
 		Relation	irel = index_open(oldId, NoLock);
 
-		stmt->oldNode = irel->rd_node.relNode;
+		/* If it's a partitioned index, there is no storage to share. */
+		if (irel->rd_rel->relkind != RELKIND_PARTITIONED_INDEX)
+			stmt->oldNode = irel->rd_node.relNode;
 		index_close(irel, NoLock);
 	}
 }
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index f8ee4b0..4b7520b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1001,6 +1001,19 @@ DefineIndex(Oid relationId,
 	ListCell   *lc;
 
 	/*
+	 * We can't use the same index name for the child index,
+	 * so clear idxname to let the recursive invocation choose
+	 * a new name.  Likewise, the existing target relation
+	 * field is wrong, and if indexOid or oldNode are set,
+	 * they mustn't be applied to the child either.
+	 */
+	childStmt->idxname = NULL;
+	childStmt->relation = NULL;
+	childStmt->relationId = childRelid;
+	childStmt->indexOid = 

Why is it OK for dbsize.c to look at relation files directly?

2019-04-24 Thread Andres Freund
Hi,

To me it seems like a noticable layering violation for dbsize.c to
directly stat() files in the filesystem, without going through
smgr.c. And now tableam.

This means there's code knowing about file segments outside of md.c -
which imo shouldn't be the case.  We also stat a lot more than
necessary, if the relation is already open - when going through smgr.c
we'd only need to stat the last segment, rather than all previous
segments.

Always going through smgr would have the disadvantage that we'd probably
need a small bit of logic to close the relation's smgr references if it
previously wasn't open, to avoid increasing memory usage unnecessarily.


dbsize.c directly stat()ing files, also means that if somebody were to
write a tableam that doesn't store data in postgres compatible segmented
files, the size can't be displayed.


I think we should change dbsize.c to call
RelationGetNumberOfBlocksInFork() for relkinds != TABLE/TOAST/MATVIEW,
and a new AM callback for those. Probably with the aforementioned
additional logic of closing smgr references if they weren't open before
the size computation.

Imo this pretty clearly is v13 work.

I'd assume that pg_database_size*() would continue the same way it does
right now.

Greetings,

Andres Freund




Re: TRACE_SORT defined by default

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 3:04 PM Tom Lane  wrote:
> > It is disabled by default, in the sense that the trace_sort GUC
> > defaults to off. I believe that the overhead of building in the
> > instrumentation without enabling it is indistinguishable from zero.
>
> It would probably be useful to actually prove that rather than just
> assuming it.

The number of individual trace_sort LOG messages is proportionate to
the number of runs produced.

> I do see some code under the symbol that is executed
> even when !trace_sort, and in any case Andres keeps complaining that
> even always-taken branches are expensive ...

I think that you're referring to the stuff needed for the D-Trace
probes. It's a pity that there isn't better support for that, since
Linux has a lot for options around static userspace probes these days
(SystemTap is very much out of favor, and never was very popular).
There seems to be a recognition among the Linux people that the
distinction between users and backend experts is blurred. The kernel
support for things like eBPF and BCC is still patchy, but that will
change.

> Would any non-wizard really have a use for it?

That depends on what the cut-off point is for wizard. I recognize that
there is a need to draw the line somewhere. I suspect that a fair
number of people could intuit problems in a real-world scenario using
trace_sort, without having any grounding in the theory, and without
much knowledge of tuplesort.c specifically.

> It seems like we should either make this really a developer option
> (and hence not enabled by default) or else move it into some other
> category than DEVELOPER_OPTIONS.

The information that it makes available is approximately the same as
the information made available by the new
pg_stat_progress_create_index view, but with getrusage() stats.

-- 
Peter Geoghegan




Re: pg_dump is broken for partition tablespaces

2019-04-24 Thread Alvaro Herrera
On 2019-Apr-23, Alvaro Herrera wrote:

> I'm not sure yet that 100% of the patch is gone, but yes much of it
> would go away thankfully.

Of course, the part that fixes the bug that indexes move tablespace when
recreated by a rewriting ALTER TABLE is still necessary.  Included in
the attached patch.

(I think it would be good to have the relation being complained about in
the error message, though that requires passing the name to
GetDefaultTablespace.)

> I do suggest we should raise an error if rule a3 hits and it mentions
> the database tablespace (I stupidly forgot this critical point in the
> previous email).  I think astonishment is lesser that way.

As in the attached.  When pg_default is the database tablespace, these
cases fail with the patch, as expected:

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) TABLESPACE 
pg_default;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE pg_default) 
PARTITION BY LIST (a);
psql: ERROR:  cannot specify default tablespace for partitioned relations


alvherre=# SET default_tablespace TO 'pg_default';

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) ;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a) TABLESPACE 
foo;
psql: ERROR:  cannot specify default tablespace for partitioned relations

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE foo) 
PARTITION BY LIST (a);
psql: ERROR:  cannot specify default tablespace for partitioned relations


These cases work:

alvherre=# CREATE TABLE q (a int PRIMARY KEY USING INDEX TABLESPACE foo) 
PARTITION BY LIST (a) TABLESPACE foo;

alvherre=# SET default_tablespace TO '';-- the default
alvherre=# CREATE TABLE q (a int PRIMARY KEY) PARTITION BY LIST (a);

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index 6d7e11645d2..4f2587d74af 100644
--- a/src/backend/bootstrap/bootparse.y
+++ b/src/backend/bootstrap/bootparse.y
@@ -314,6 +314,7 @@ Boot_DeclareIndexStmt:
 	stmt->transformed = false;
 	stmt->concurrent = false;
 	stmt->if_not_exists = false;
+	stmt->reset_default_tblspc = false;
 
 	/* locks and races need not concern us in bootstrap mode */
 	relationId = RangeVarGetRelid(stmt->relation, NoLock,
@@ -363,6 +364,7 @@ Boot_DeclareUniqueIndexStmt:
 	stmt->transformed = false;
 	stmt->concurrent = false;
 	stmt->if_not_exists = false;
+	stmt->reset_default_tblspc = false;
 
 	/* locks and races need not concern us in bootstrap mode */
 	relationId = RangeVarGetRelid(stmt->relation, NoLock,
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a1c91b5fb87..3599c0d8ce0 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -467,8 +467,21 @@ DefineIndex(Oid relationId,
 	LOCKTAG		heaplocktag;
 	LOCKMODE	lockmode;
 	Snapshot	snapshot;
+	int			save_nestlevel = -1;
 	int			i;
 
+	/*
+	 * Some callers need us to run with an empty default_tablespace; this is a
+	 * necessary hack to be able to reproduce catalog state accurately when
+	 * recreating indexes after table-rewriting ALTER TABLE.
+	 */
+	if (stmt->reset_default_tblspc)
+	{
+		save_nestlevel = NewGUCNestLevel();
+		(void) set_config_option("default_tablespace", "",
+ PGC_USERSET, PGC_S_SESSION,
+ GUC_ACTION_SAVE, true, 0, false);
+	}
 
 	/*
 	 * Start progress report.  If we're building a partition, this was already
@@ -622,10 +635,15 @@ DefineIndex(Oid relationId,
 	if (stmt->tableSpace)
 	{
 		tablespaceId = get_tablespace_oid(stmt->tableSpace, false);
+		if (partitioned && tablespaceId == MyDatabaseTableSpace)
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	 errmsg("cannot specify default tablespace for partitioned relation")));
 	}
 	else
 	{
-		tablespaceId = GetDefaultTablespace(rel->rd_rel->relpersistence);
+		tablespaceId = GetDefaultTablespace(rel->rd_rel->relpersistence,
+			partitioned);
 		/* note InvalidOid is OK in this case */
 	}
 
@@ -980,6 +998,13 @@ DefineIndex(Oid relationId,
 
 	ObjectAddressSet(address, RelationRelationId, indexRelationId);
 
+	/*
+	 * Revert to original default_tablespace.  Must do this before any return
+	 * from this function, but after index_create, so this is a good time.
+	 */
+	if (save_nestlevel >= 0)
+		AtEOXact_GUC(true, save_nestlevel);
+
 	if (!OidIsValid(indexRelationId))
 	{
 		table_close(rel, NoLock);
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 2aac63296bf..99bf3c29f27 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -284,7 +284,7 @@ ExecRefreshMatView(RefreshMatViewStmt 

Re: TRACE_SORT defined by default

2019-04-24 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Apr 24, 2019 at 2:15 PM Joe Conway  wrote:
>> Has anyone ever (or recently) measured the impact on performance to have
>> this enabled? Is it that generically useful for debugging of production
>> instances of Postgres that we really want it always enabled despite the
>> performance impact?

> It is disabled by default, in the sense that the trace_sort GUC
> defaults to off. I believe that the overhead of building in the
> instrumentation without enabling it is indistinguishable from zero.

It would probably be useful to actually prove that rather than just
assuming it.  I do see some code under the symbol that is executed
even when !trace_sort, and in any case Andres keeps complaining that
even always-taken branches are expensive ...

> In
> any case the current status quo is that it's built by default. I have
> used it in production, though not very often. It's easy to turn it on
> and off.

Would any non-wizard really have a use for it?

It seems like we should either make this really a developer option
(and hence not enabled by default) or else move it into some other
category than DEVELOPER_OPTIONS.

regards, tom lane




Re: TRACE_SORT defined by default

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 2:29 PM Alvaro Herrera  wrote:
> This is a really strange argument.  You're saying that somebody thought
> about it: "Hmm, well, I can remove this preprocessor symbol but then
> trace_sort would no longer resemble a developer option.  So I'm going to
> leave the symbol alone".  I don't think that's what happened.  It seems
> more likely to me that nobody has gone to the trouble of deciding that
> the symbol is worth removing, let alone actually doing it.

It doesn't seem very important now.

> If the instrumentation is good, and you seem to be saying that it is, I
> think we should just remove the symbol and be done with it.

Sounds like a plan. Do you want to take care of it, Joe?

-- 
Peter Geoghegan




Re: TRACE_SORT defined by default

2019-04-24 Thread Alvaro Herrera
On 2019-Apr-24, Peter Geoghegan wrote:

> I suspect that the reason that this hasn't happened already is because
> it leaves trace_sort/TRACE_SORT in the slightly awkward position of no
> longer quite meeting the traditional definition of a "developer
> option".

This is a really strange argument.  You're saying that somebody thought
about it: "Hmm, well, I can remove this preprocessor symbol but then
trace_sort would no longer resemble a developer option.  So I'm going to
leave the symbol alone".  I don't think that's what happened.  It seems
more likely to me that nobody has gone to the trouble of deciding that
the symbol is worth removing, let alone actually doing it.

If the instrumentation is good, and you seem to be saying that it is, I
think we should just remove the symbol and be done with it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: TRACE_SORT defined by default

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 2:15 PM Joe Conway  wrote:
> Has anyone ever (or recently) measured the impact on performance to have
> this enabled? Is it that generically useful for debugging of production
> instances of Postgres that we really want it always enabled despite the
> performance impact?

It is disabled by default, in the sense that the trace_sort GUC
defaults to off. I believe that the overhead of building in the
instrumentation without enabling it is indistinguishable from zero. In
any case the current status quo is that it's built by default. I have
used it in production, though not very often. It's easy to turn it on
and off.

> Maybe the answer to both is "yes", but if so I would agree that we ought
> to remove the define and ifdef's and just bake it in.

We're only talking about removing the option of including the
instrumentation in binaries when Postgres is built. I'm not aware that
anyone is doing that. It nobody was doing that, then nobody could be
affected by removing the #ifdef crud.

I suspect that the reason that this hasn't happened already is because
it leaves trace_sort/TRACE_SORT in the slightly awkward position of no
longer quite meeting the traditional definition of a "developer
option".

-- 
Peter Geoghegan




Re: TRACE_SORT defined by default

2019-04-24 Thread Joe Conway
On 4/24/19 5:10 PM, Peter Geoghegan wrote:
> On Wed, Apr 24, 2019 at 2:07 PM Joe Conway  wrote:
>> I just noticed that TRACE_SORT is defined by default (since 2005
>> apparently). It seems odd since it is the only debugging code enabled by
>> default.
> 
> I think that we should get rid of the #ifdef stuff, so that it isn't
> possible to disable the trace_sort instrumentation my commenting out
> the TRACE_SORT entry in pg_config_manual.h. I recall being opposed on
> this point by Robert Haas. Possibly because he just didn't want to
> deal with it at the time.


Has anyone ever (or recently) measured the impact on performance to have
this enabled? Is it that generically useful for debugging of production
instances of Postgres that we really want it always enabled despite the
performance impact?

Maybe the answer to both is "yes", but if so I would agree that we ought
to remove the define and ifdef's and just bake it in.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: TRACE_SORT defined by default

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 2:07 PM Joe Conway  wrote:
> I just noticed that TRACE_SORT is defined by default (since 2005
> apparently). It seems odd since it is the only debugging code enabled by
> default.

I think that we should get rid of the #ifdef stuff, so that it isn't
possible to disable the trace_sort instrumentation my commenting out
the TRACE_SORT entry in pg_config_manual.h. I recall being opposed on
this point by Robert Haas. Possibly because he just didn't want to
deal with it at the time.

-- 
Peter Geoghegan




TRACE_SORT defined by default

2019-04-24 Thread Joe Conway
I just noticed that TRACE_SORT is defined by default (since 2005
apparently). It seems odd since it is the only debugging code enabled by
default.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: Regression test coverage of GiST index build is awful

2019-04-24 Thread Tom Lane
Alexander Korotkov  writes:
> On Wed, Apr 24, 2019 at 9:31 PM Tom Lane  wrote:
>> Why is this so bad?  It's not like the gist regression test isn't
>> ridiculously expensive already; I'd have expected it to provide
>> darn near 100% coverage for what it's costing in runtime.

> I don't think there is any idea behind this.  Seems to be just oversight.

After poking at it a bit, the answer seems to be that the gist buffering
code isn't invoked till we get to an index size of effective_cache_size/4,
which by default would be way too much for any regression test index.

> Do you like me to write a patch improving coverage here?

Somebody needs to... that's an awful lot of code to not be testing.

regards, tom lane




Re: Zedstore - compressed in-core columnar storage

2019-04-24 Thread Ashwin Agrawal
On Tue, Apr 16, 2019 at 9:15 AM Tomas Vondra 
wrote:

>
> I'm not sure it's that clear cut, actually. Sure, it's not the usual
> (block,item) pair so it's not possible to jump to the exact location, so
> it's not the raw physical identifier as regular TID. But the data are
> organized in a btree, with the TID as a key, so it does actually provide
> some information about the location.
>

>From representation perspective its logical identifier. But yes
since
is used as used as key to layout datum's, there exists pretty
good
correlation between TIDs and physical location. Can consider it
as
clustered based on TID.

I've asked about BRIN indexes elsewhere in this thread, which I think is
> related to this question, because that index type relies on TID providing
> sufficient information about location. And I think BRIN indexes are going
> to be rather important for colstores (and formats like ORC have something
> very similar built-in).
>
> But maybe all we'll have to do is define the ranges differently - instead
> of "number of pages" we may define them as "number of rows" and it might
> be working.
>

BRIN indexes work for zedstore right now. A block range maps
to
just a range of TIDs in zedstore, as pointed out above. When one converts
a
zstid to an ItemPointer, can get the "block number" from
the

ItemPointer, like from a normal heap TID. It doesn't mean the
direct
physical location of the row in zedstore, but that's
fine.



It might be sub-optimal in some cases. For example if one
zedstore

page contains TIDs 1-1000, and another 1000-2000, and the entry in
the
BRIN index covers TIDs 500-1500, have to access both
zedstore

pages. Would be better if the cutoff points in the BRIN index
would
match the physical pages of the zedstore. But it still works, and
is
probably fine in
practice.



Plan is to add integrated BRIN index in zedstore, means keep
min-max
values for appropriate columns within page. This will not help
to
eliminate the IO as external BRIN index does but helps to
skip

uncompression and visibility checks etc... for blocks not matching
the
conditions.



Just to showcase brin works for zedstore, played with hands-on example
mentioned in
[1].



With btree index on zedstore

QUERY
PLAN

-

 Aggregate  (cost=4351.50..4351.51 rows=1 width=32) (actual
time=1267.140..1267.140 rows=1
loops=1)

   ->  Index Scan using idx_ztemperature_log_log_timestamp on
ztemperature_log  (cost=0.56..4122.28 rows=91686 width=4) (actual
time=0.117..1244.112 rows=86400
loops=1)

 Index Cond: ((log_timestamp >= '2016-04-04 00:00:00'::timestamp
without time zone) AND (log_timestamp < '2016-04-05 00:00:00'::timestamp
without time
zone))

 Planning Time: 0.240
ms

 Execution Time: 1269.016
ms

(5
rows)



With brin index on zedstore.
Note: Bitmap index for zedstore currently scans all the columns.
Scanning only required columns for query is yet to be implemented.





QUERY
PLAN



 Finalize Aggregate  (cost=217538.85..217538.86 rows=1 width=32) (actual
time=54.167..54.167 rows=1
loops=1)

   ->  Gather  (cost=217538.63..217538.84 rows=2 width=32) (actual
time=53.967..55.184 rows=3
loops=1)

 Workers Planned:
2

 Workers Launched:
2

 ->  Partial Aggregate  (cost=216538.63..216538.64 rows=1 width=32)
(actual time=42.956..42.957 rows=1
loops=3)

   ->  Parallel Bitmap Heap Scan on ztemperature_log
(cost=59.19..216446.98 rows=36660 width=4) (actual time=3.571..35.904
rows=28800
loops=3)

 Recheck Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))

 Rows Removed by Index Recheck:
3968

 Heap Blocks:
lossy=381

 ->  Bitmap Index Scan on
idx_ztemperature_log_log_timestamp  (cost=0.00..37.19 rows=98270 width=0)
(actual time=1.201..1.201 rows=7680
loops=1)

   Index Cond: ((log_timestamp >= '2016-04-04
00:00:00'::timestamp without time zone) AND (log_timestamp < '2016-04-05
00:00:00'::timestamp without time
zone))

 Planning Time: 0.240
ms

 Execution Time: 55.341
ms

(13
rows)



 schema_name | index_name | index_ratio |
index_size |
table_size

-++-++

 public  | idx_ztemperature_log_log_timestamp |   0 | 80
kB  | 1235
MB

(1
row)


1]
https://www.postgresql.fastware.com/blog/brin-indexes-what-are-they-and-how-do-you-use-them


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-24 Thread Tom Lane
I wrote:
> Here's a draft patch for that.
>
> It's slightly ugly that this adds a dependency on commands/tablespace
> to fd.c, which is a pretty low-level module.  I think wanting to avoid
> that layering violation might've been the reason for doing things the
> way they are.  However, this gets rid of tablespace dependencies in
> some other files that are only marginally higher-level, like
> tuplesort.c, so I'm not sure how strong that objection is.
>
> There are three functions in fd.c that have a dependency on the
> temp tablespace info having been set up:
>   OpenTemporaryFile
>   GetTempTablespaces
>   GetNextTempTableSpace
> This patch makes the first of those automatically set up the info
> if it's not done yet.  The second one has always had an assertion
> that the caller did it already, and now the third one does too.

After a bit more thought it seemed like another answer would be to
make all three of those functions assert that the caller did the
right thing, as per attached.  This addresses the layering-violation
complaint, but might be more of a pain in the rear for developers.

Not really sure which way I like better.

regards, tom lane

diff --git a/src/backend/access/gist/gistbuildbuffers.c b/src/backend/access/gist/gistbuildbuffers.c
index 4f2363e..1d008ec 100644
--- a/src/backend/access/gist/gistbuildbuffers.c
+++ b/src/backend/access/gist/gistbuildbuffers.c
@@ -17,6 +17,7 @@
 #include "access/genam.h"
 #include "access/gist_private.h"
 #include "catalog/index.h"
+#include "commands/tablespace.h"
 #include "miscadmin.h"
 #include "storage/buffile.h"
 #include "storage/bufmgr.h"
@@ -58,6 +59,8 @@ gistInitBuildBuffers(int pagesPerBuffer, int levelStep, int maxLevel)
 	 * Create a temporary file to hold buffer pages that are swapped out of
 	 * memory.
 	 */
+	PrepareTempTablespaces();
+
 	gfbb->pfile = BufFileCreateTemp(false);
 	gfbb->nFileBlocks = 0;
 
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index fdac985..a4463ba 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1459,6 +1459,9 @@ PathNameDeleteTemporaryDir(const char *dirname)
  * outlive the transaction that created them, so this should be false -- but
  * if you need "somewhat" temporary storage, this might be useful. In either
  * case, the file is removed when the File is explicitly closed.
+ *
+ * Unless interXact is true, some caller function should have done
+ * PrepareTempTablespaces().
  */
 File
 OpenTemporaryFile(bool interXact)
@@ -1481,7 +1484,7 @@ OpenTemporaryFile(bool interXact)
 	 * force it into the database's default tablespace, so that it will not
 	 * pose a threat to possible tablespace drop attempts.
 	 */
-	if (numTempTableSpaces > 0 && !interXact)
+	if (!interXact)
 	{
 		Oid			tblspcOid = GetNextTempTableSpace();
 
@@ -2732,6 +2735,7 @@ GetTempTablespaces(Oid *tableSpaces, int numSpaces)
 Oid
 GetNextTempTableSpace(void)
 {
+	Assert(TempTablespacesAreSet());
 	if (numTempTableSpaces > 0)
 	{
 		/* Advance nextTempTableSpace counter with wraparound */


Re: Calling PrepareTempTablespaces in BufFileCreateTemp

2019-04-24 Thread Melanie Plageman
On Tue, Apr 23, 2019 at 1:06 PM Tom Lane  wrote:

> There are three functions in fd.c that have a dependency on the
> temp tablespace info having been set up:
> OpenTemporaryFile
> GetTempTablespaces
> GetNextTempTableSpace
> This patch makes the first of those automatically set up the info
> if it's not done yet.  The second one has always had an assertion
> that the caller did it already, and now the third one does too.
> An about equally plausible change would be to make all three
> call PrepareTempTablespaces, but there are so few callers of the
> second and third that I'm not sure that'd be better.  Thoughts?
>
>
I think an assertion is sufficiently clear for GetNextTempTableSpace based
on
what it does and its current callers. The same is probably true for
GetTempTableSpaces.

-- 
Melanie Plageman


Re: pg_dump partitions can lead to inconsistent state after restore

2019-04-24 Thread Alvaro Herrera
So, while testing this I noticed that pg_restore fails with deadlocks if
you do a parallel restore if the --load-via-partition-root switch was
given to pg_dump.  Is that a known bug?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimizer items in the release notes

2019-04-24 Thread Adam Brusselback
As a user, I am interested in the optimizer changes for sure, and I
actually had wished they were highlighted more in previous releases.

> I think planner smarts are arguably one of our weakest areas when
> compared to the big commercial databases. The more we can throw in
> there about this sort of thing the better.

Completely agree on both fronts. I have run into numerous optimizations I
had taken for granted when I worked primarily with SQL Server and were not
present in Postgres.
Work being done to make the Postgres optimizer smarter is great, as is
highlighting that work in the release notes IMO.


Re: Regression test coverage of GiST index build is awful

2019-04-24 Thread Alexander Korotkov
On Wed, Apr 24, 2019 at 9:31 PM Tom Lane  wrote:
> This tells a pretty scary story:
>
> https://coverage.postgresql.org/src/backend/access/gist/index.html
>
> In particular, gistbuildbuffers.c is not entered *at all*, and
> gistbuild.c is only 21% covered.
>
> I noticed this after adding an assertion that I expected
> gistInitBuildBuffers to fail on, and nonetheless getting
> through check-world just fine.
>
> Why is this so bad?  It's not like the gist regression test isn't
> ridiculously expensive already; I'd have expected it to provide
> darn near 100% coverage for what it's costing in runtime.

I don't think there is any idea behind this.  Seems to be just oversight.

Do you like me to write a patch improving coverage here?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Regression test coverage of GiST index build is awful

2019-04-24 Thread Tom Lane
This tells a pretty scary story:

https://coverage.postgresql.org/src/backend/access/gist/index.html

In particular, gistbuildbuffers.c is not entered *at all*, and
gistbuild.c is only 21% covered.

I noticed this after adding an assertion that I expected
gistInitBuildBuffers to fail on, and nonetheless getting
through check-world just fine.

Why is this so bad?  It's not like the gist regression test isn't
ridiculously expensive already; I'd have expected it to provide
darn near 100% coverage for what it's costing in runtime.

regards, tom lane




Re: jsonpath

2019-04-24 Thread Alexander Korotkov
On Mon, Apr 22, 2019 at 1:39 AM Tom Lane  wrote:
> Alexander Korotkov  writes:
> >> On Wed, Apr 17, 2019 at 8:43 PM Tom Lane  wrote:
> >>> Yeah, I'd noticed that one too :-(.  I think the whole jsonpath patch
> >>> needs a sweep to bring its error messages into line with our style
> >>> guidelines, but no harm in starting with the obvious bugs.
>
> > I went trough the jsonpath errors and made some corrections.  See the
> > attached patch.
>
> Please don't do this sort of change:
>
> -elog(ERROR, "unrecognized jsonpath item type: %d", item->type);
> +ereport(ERROR,
> +(errcode(ERRCODE_INTERNAL_ERROR),
> + errmsg("unrecognized jsonpath item type: %d", 
> item->type)));
>
> elog() is the appropriate thing for shouldn't-happen internal errors like
> these.  The only thing you've changed here, aside from making the source
> code longer, is to expose the error message for translation ... which is
> really just wasting translators' time.  Only messages we actually think
> users might need to deal with should be exposed for translation.

Makes sense.  Removed from the patch.

> @@ -623,7 +624,7 @@ executeItemOptUnwrapTarget(JsonPathExecContext *cxt, 
> JsonPathItem *jsp,
>  ereport(ERROR,
>  (errcode(ERRCODE_JSON_MEMBER_NOT_FOUND), \
>   errmsg(ERRMSG_JSON_MEMBER_NOT_FOUND),
> - errdetail("JSON object does not contain key %s",
> + errdetail("JSON object does not contain key 
> %s.",
> keybuf.data)));
>  }
>  }
>
> OK as far as it went, but you should also put double quotes around the %s.

This code actually passed key trough escape_json(), which adds double
quotes itself.  However, we don't do such transformation in other
places.  So, patch removes call of ecsape_json() while putting double
quotes to the error message.

> (I also noticed some messages that are using single-quotes around
> interpolated strings, which is not the project standard either.)

Single-quotes are replaced with double-quotes.

> Other specific things I wanted to see fixed:
>
> * jsonpath_scan.l has some messages like "bad ..." which is not project
> style; use "invalid" or "unrecognized".  (There's probably no good
> reason not to use the same string "invalid input syntax for type jsonpath"
> that is used elsewhere.)

Fixed.

> * This in jsonpath_gram.y is quite unhelpful:
>
> yyerror(NULL, "unrecognized flag of LIKE_REGEX predicate");
>
> since it doesn't tell you what flag character it doesn't like
> (and the error positioning info isn't accurate enough to let the
> user figure that out).  It really needs to be something more like
> "unrecognized flag character \"%c\" in LIKE_REGEX predicate".
> That probably means you can't use yyerror for this, but I don't
> think yyerror was providing any useful functionality anyway :-(

Fixed.

> More generally, I'm not very much on board with this coding technique:
>
> /* Standard error message for SQL/JSON errors */
> #define ERRMSG_JSON_ARRAY_NOT_FOUND"SQL/JSON array not found"
>
> ...
>
> RETURN_ERROR(ereport(ERROR,
>  (errcode(ERRCODE_JSON_ARRAY_NOT_FOUND),
>   errmsg(ERRMSG_JSON_ARRAY_NOT_FOUND),
>   errdetail("Jsonpath wildcard array 
> accessor "
>
> In the first place, I'm not certain that this will result in the error
> message being translatable --- do the gettext tools know how to expand
> macros?
>
> In the second place, the actual strings are just restatements of their
> ERRMSG macro names, which IMO is not conformant to our message style,
> but it's too hard to see that from source code like this.  Also this
> style is pretty unworkable/unfriendly if the message needs to contain
> any %-markers, so I suspect that having a coding style like this may be
> discouraging you from providing values in places where it'd be helpful to
> do so.  What I actually see happening as a consequence of this approach is
> that you're pushing the useful information off to an errdetail, which is
> not really helpful and it's not per project style either.  The idea is to
> make the primary message as helpful as possible without being long, not
> to make it a simple restatement of the SQLSTATE that nobody can understand
> without also looking at the errdetail.
>
> In the third place, this makes it hard for people to grep for occurrences
> of an error string in our source code.
>
> And in the fourth place, we don't do this elsewhere; it does not help
> anybody for jsonpath to invent its own coding conventions that are unlike
> the rest of Postgres.
>
> So I think you should drop the ERRMSG_xxx macros, write out these error
> messages where they are used, and rethink your use of errmsg vs. errdetail.


Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 10:43 AM Peter Geoghegan  wrote:
> The hard part is how to do varwidth encoding for space-efficient
> partition numbers while continuing to use IndexTuple fields for heap
> TID on the leaf level, *and* also having a
> BTreeTupleGetHeapTID()-style macro to get partition number without
> walking the entire index tuple. I suppose you could make the byte at
> the end of the tuple indicate that there are in fact 31 bits total
> when its high bit is set -- otherwise it's a 7 bit integer. Something
> like that may be the way to go. The alignment rules seem to make it
> worthwhile to keep the heap TID in the tuple header; it seems
> inherently necessary to have a MAXALIGN()'d tuple header, so finding a
> way to consistently put the first MAXALIGN() quantum to good use seems
> wise.

It's even harder than that if you want to make it possible to walk the
tuple from either direction, which also seems useful. You want to be
able to jump straight to the end of the tuple to get the partition
number, while at the same time being able to access it in the usual
way, as if it was just another attribute.

Ugh, this is a mess. It would be so much easier if we had a tuple
representation that stored attribute offsets right in the header.

--
Peter Geoghegan




Re: finding changed blocks using WAL scanning

2019-04-24 Thread Robert Haas
On Wed, Apr 24, 2019 at 10:10 AM Tomas Vondra
 wrote:
> >I'm still interested in the answer to this question, but I don't see a
> >reply that specifically concerns it.  Apologies if I have missed one.
>
> I don't think prefetching WAL blocks is all that important. The WAL
> segment was probably received fairly recently (either from primary or
> archive) and so it's reasonable to assume it's still in page cache. And
> even if it's not, sequential reads are handled by readahead pretty well.
> Which is a form of prefetching.

True.  But if you are going to need to read the WAL anyway to apply
it, why shouldn't the prefetcher just read it first and use that to
drive prefetching, instead of using the modblock files?  It's strictly
less I/O, because you were going to read the WAL files anyway and now
you don't have to also read some other modblock file, and it doesn't
really seem to have any disadvantages.

> >It's pretty clear in my mind that what I want to do here is provide
> >approximate information, not exact information.  Being able to sort
> >and deduplicate in advance seems critical to be able to make something
> >like this work on high-velocity systems.
>
> Do you have any analysis / data to support that claim? I mean, it's
> obvious that sorting and deduplicating the data right away makes
> subsequent processing more efficient, but it's not clear to me that not
> doing it would make it useless for high-velocity systems.

I did include some analysis of this point in my original post.  It
does depend on your assumptions.  If you assume that users will be OK
with memory usage that runs into the tens of gigabytes when the amount
of change since the last incremental backup is very large, then there
is probably no big problem, but that assumption sounds shaky to me.

(The customers I seem to end up on the phone with seem to be
disproportionately those running enormous systems on dramatically
underpowered hardware, which is not infrequently related to the reason
I end up on the phone with them.)

> Sure, but that's not what I proposed elsewhere in this thread. My proposal
> was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
> ~3 last checkpoints), and deduplicate them after that. So vast majority of
> the 1TB of WAL will have already deduplicated data.

OK, I missed that proposal.  My biggest concern about this is that I
don't see how to square this with the proposal elsewhere on this
thread that these files should be put someplace that makes them
subject to archiving.  If the files are managed by the master in a
separate directory it can easily do this sort of thing, but if they're
archived then you can't.  Now maybe that's just a reason not to adopt
that proposal, but I don't see how to adopt both that proposal and
this one, unless we just say that we're going to spew craptons of tiny
little non-deduplicated modblock files into the archive.

> Also, maybe we can do partial deduplication, in a way that would be useful
> for prefetching. Say we only deduplicate 1MB windows - that would work at
> least for cases that touch the same page frequently (say, by inserting to
> the tail of an index, or so).

Maybe, but I'm not sure that's really optimal for any use case.

> FWIW no one cares about low-velocity systems. While raw modblock files
> would not be an issue on them, it's also mostly uninteresting from the
> prefetching perspective. It's the high-velocity sytems that have lag.

I don't think that's particularly fair.  Low-velocity systems are some
of the best candidates for incremental backup, and people who are
running such systems probably care about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-24 Thread Peter Geoghegan
On Wed, Apr 24, 2019 at 5:22 AM Robert Haas  wrote:
> If you drop or detach a partition, you can either (a) perform, as part
> of that operation, a scan of every global index to remove all
> references to the former partition, or (b) tell each global indexes
> that all references to that partition number ought to be regarded as
> dead index tuples.  (b) makes detaching partitions faster and (a)
> seems hard to make rollback-safe, so I'm guessing we'll end up with
> (b).

I agree that (b) is the way to go.

> We don't want people to be able to exhaust the supply of partition
> numbers the way they can exhaust the supply of attribute numbers by
> adding and dropping columns repeatedly.

I agree that a partition numbering system needs to be able to
accommodate arbitrarily-many partitions over time. It wouldn't have
occurred to me to do it any other way. It is far far easier to make
this work than it would be to retrofit varwidth attribute numbers. We
won't have to worry about the HeapTupleHeaderGetNatts()
representation. At the same time, nothing stops us from representing
partition numbers in a simpler though less space efficient way in
system catalogs.

The main point of having global indexes is to be able to push down the
partition number and use it during index scans. We can store the
partition number at the end of the tuple on leaf pages, so that it's
easily accessible (important for VACUUM), while continuing to use the
IndexTuple fields for heap TID. On internal pages, the IndexTuple
fields must be used for the downlink (block number of child), so both
partition number and heap TID have to go towards the end of the tuples
(this happens just with heap TID on Postgres 12). Of course, suffix
truncation will manage to consistently get rid of both in most cases,
especially when the global index is a unique index.

The hard part is how to do varwidth encoding for space-efficient
partition numbers while continuing to use IndexTuple fields for heap
TID on the leaf level, *and* also having a
BTreeTupleGetHeapTID()-style macro to get partition number without
walking the entire index tuple. I suppose you could make the byte at
the end of the tuple indicate that there are in fact 31 bits total
when its high bit is set -- otherwise it's a 7 bit integer. Something
like that may be the way to go. The alignment rules seem to make it
worthwhile to keep the heap TID in the tuple header; it seems
inherently necessary to have a MAXALIGN()'d tuple header, so finding a
way to consistently put the first MAXALIGN() quantum to good use seems
wise.

-- 
Peter Geoghegan




Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-24 10:13:09 -0400, Tom Lane wrote:
>> I can't say that I like 0001 at all.  It adds a bunch of complication and
>> new failure modes (e.g., having to panic on chdir failure) in order to do
>> what exactly?  I've not been following the thread closely, but the
>> original problem is surely just a dont-do-that misconfiguration.  I also
>> suspect that this is assuming way too much about the semantics of getcwd
>> --- some filesystem configurations may have funny situations like multiple
>> paths to the same place.

> I'm not at all defending the conrete patch. But I think allowing
> relative paths to tablespaces would solve a whole lot of practical
> problems, while not meaningfully increasing failure modes.

I'm not against allowing relative tablespace paths.  But I did not like
the chdir and getcwd-semantics hazards --- why do we have to buy into
all that to allow relative paths?

I think it would likely be sufficient to state plainly in the docs
that a relative path had better point outside $PGDATA, and maybe
have some *simple* tests on the canonicalized form of the path to
prevent obvious mistakes.  Going further than that is likely to add
more problems than it removes.

regards, tom lane




Re: block-level incremental backup

2019-04-24 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost  wrote:
> > At least in part then it seems like we're viewing the level of effort
> > around what I'm talking about quite differently, and I feel like that's
> > largely because every time I mention parallel anything there's this
> > assumption that I'm asking you to parallelize pg_basebackup or write a
> > whole bunch more code to provide a fully optimized server-side parallel
> > implementation for backups.  That really wasn't what I was going for.  I
> > was thinking it would be a modest amount of additional work add
> > incremental backup via a few new commands, instead of through the
> > BASE_BACKUP protocol command, that would make parallelization possible.
> 
> I'm not sure about that.  It doesn't seem crazy difficult, but there
> are a few wrinkles.  One is that if the client is requesting files one
> at a time, it's got to have a list of all the files that it needs to
> request, and that means that it has to ask the server to make a
> preparatory pass over the whole PGDATA directory to get a list of all
> the files that exist.  That overhead is not otherwise needed.  Another
> is that the list of files might be really large, and that means that
> the client would either use a lot of memory to hold that great big
> list, or need to deal with spilling the list to a spool file
> someplace, or else have a server protocol that lets the list be
> fetched in incrementally in chunks.

So, I had a thought about that when I was composing the last email and
while I'm still unsure about it, maybe it'd be useful to mention it
here- do we really need a list of every *file*, or could we reduce that
down to a list of relations + forks for the main data directory, and
then always include whatever other directories/files are appropriate?

When it comes to operating in chunks, well, if we're getting a list of
relations instead of files, we do have this thing called cursors..

> A third is that, as you mention
> further on, it means that the client has to care a lot more about
> exactly how the server is figuring out which blocks have been
> modified.  If it just says BASE_BACKUP ..., the server an be
> internally reading each block and checking the LSN, or using
> WAL-scanning or ptrack or whatever and the client doesn't need to know
> or care.  But if the client is asking for a list of modified files or
> blocks, then that presumes the information is available, and not too
> expensively, without actually reading the files.

I would think the client would be able to just ask for the list of
modified files, when it comes to building up the list of files to ask
for, which could potentially be done based on mtime instead of by WAL
scanning or by scanning the files themselves.  Don't get me wrong, I'd
prefer that we work based on the WAL, since I have more confidence in
that, but certainly quite a few of the tools do work off mtime these
days and while it's not perfect, the risk/reward there is pretty
palatable to a lot of people.

> Fourth, MAX_RATE
> probably won't actually limit to the correct rate overall if the limit
> is applied separately to each file.

Sure, I hadn't been thinking about MAX_RATE and that would certainly
complicate things if we're offering to provide MAX_RATE-type
capabilities as part of this new set of commands.

> I'd be afraid that a patch that tried to handle all that as part of
> this project would get rejected on the grounds that it was trying to
> solve too many unrelated problems.  Also, though not everybody has to
> agree on what constitutes a "modest amount of additional work," I
> would not describe solving all of those problems as a modest effort,
> but rather a pretty substantial one.

I suspect some of that's driven by how they get solved and if we decide
we have to solve all of them.  With things like MAX_RATE + incremental
backups, I wonder how that's going to end up working, when you have the
option to apply the limit to the network, or to the disk I/O.  You might
have addressed that elsewhere, I've not looked, and I'm not too
particular about it personally either, but a definition could be "max
rate at which we'll read the file you asked for on this connection" and
that would be pretty straight-forward, I'd think.

> > > Well, one thing you might want to do is have a tool that connects to
> > > the server, enters backup mode, requests information on what blocks
> > > have changed, copies those blocks via direct filesystem access, and
> > > then exits backup mode.  Such a tool would really benefit from a
> > > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > > command language, because it would just skip ever issuing the
> > > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > > command is useless to such a tool.
> >
> > That's true, but I hardly ever hear 

Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Ashwin Agrawal
On Wed, Apr 24, 2019 at 9:25 AM Andres Freund  wrote:

>  The inability
> to reasonably test master/standby setups on a single machine is pretty
> jarring (yes, one can use basebackup tablespace maps - but that doesn't
> work well for new tablespaces).


+1 agree. Feature which can't be easily tested becomes hurdle for
development and this is one of them. For reference the bug reported in [1]
is hard to test and fix without having easy ability to setup master/standby
on same node. We discussed few ways to eliminate the issue in thread [2]
but I wasn't able to find a workable solution. It would be really helpful
to lift this testing limitation.

1]
https://www.postgresql.org/message-id/flat/20190423.163949.36763221.horiguchi.kyotaro%40lab.ntt.co.jp#7fdeee86f3050df6315c04f5f6f93672
2]
https://www.postgresql.org/message-id/flat/CALfoeivGMTmCmSXRSWDf%3DujWS7L8QmoUoziv-A61f2R8DcmwiA%40mail.gmail.com#709b53c078ebe549cff2462c092a8f09


Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Andres Freund
Hi,

On 2019-04-24 17:02:28 +0900, Kyotaro HORIGUCHI wrote:
> +/*
> + * Check if the path is in the data directory strictly.
> + */
> +static bool
> +is_in_data_directory(const char *path)
> +{
> + char cwd[MAXPGPATH];
> + char abspath[MAXPGPATH];
> + char absdatadir[MAXPGPATH];
> +
> + getcwd(cwd, MAXPGPATH);
> + if (chdir(path) < 0)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("invalid directory \"%s\": %m", path)));
> +
> + /* getcwd is defined as returning absolute path */
> + getcwd(abspath, MAXPGPATH);
> +
> + /* DataDir needs to be canonicalized */
> + if (chdir(DataDir))
> + ereport(FATAL,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("could not chdir to the data directory 
> \"%s\": %m",
> + DataDir)));
> + getcwd(absdatadir, MAXPGPATH);
> +
> + /* this must succeed */
> + if (chdir(cwd))
> + ereport(FATAL,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +  errmsg("could not chdir to the current working 
> directory \"%s\": %m",
> +  cwd)));
> +
> + return path_is_prefix_of_path(absdatadir, abspath);
> +}

This seems like a bad idea to me. Why don't we just use
make_absolute_path() on the proposed tablespace path, and then check
path_is_prefix_of() or such? Sure, that can be tricked using symlinks
etc, but that's already the case.

Greetings,

Andres Freund




Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Andres Freund
Hi,

On 2019-04-24 10:13:09 -0400, Tom Lane wrote:
> Kyotaro HORIGUCHI  writes:
> > At Wed, 24 Apr 2019 13:23:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> >  wrote in 
> > <20190424.132304.40676137.horiguchi.kyot...@lab.ntt.co.jp>
> >> We need to adjust relative path between PGDATA-based and
> >> pg_tblspc based. The attached first patch does that.
> 
> > This is new version. Adjusted pg_basebackup's behavior to allow
> > relative mappings. But..
> 
> I can't say that I like 0001 at all.  It adds a bunch of complication and
> new failure modes (e.g., having to panic on chdir failure) in order to do
> what exactly?  I've not been following the thread closely, but the
> original problem is surely just a dont-do-that misconfiguration.  I also
> suspect that this is assuming way too much about the semantics of getcwd
> --- some filesystem configurations may have funny situations like multiple
> paths to the same place.

I'm not at all defending the conrete patch. But I think allowing
relative paths to tablespaces would solve a whole lot of practical
problems, while not meaningfully increasing failure modes. The inability
to reasonably test master/standby setups on a single machine is pretty
jarring (yes, one can use basebackup tablespace maps - but that doesn't
work well for new tablespaces). And for a lot of production setups
absolute paths suck too - it's far from a given that primary / standby
databases need to have the same exact path layout.

Greetings,

Andres Freund




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-24 Thread Andres Freund
Hi,

On 2019-04-24 11:28:32 +0530, Amit Kapila wrote:
> On Tue, Apr 23, 2019 at 10:59 PM Andres Freund  wrote:
> > > > On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> > > I think we should first try to see in this new scheme (a) when to set
> > > the map, (b) when to clear it, (c) how to use.  I have tried to
> > > summarize my thoughts about it, let me know what do you think about
> > > the same?
> > >
> > > When to set the map.
> > > At the beginning (the first time relation is used in the backend), the
> > > map will be clear.  When the first time in the backend, we find that
> > > FSM doesn't exist and the number of blocks is lesser than
> > > HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
> > > exist at that time and mark all or alternate blocks as available.
> >
> > I think the alternate blocks scheme has to go. It's not defensible.
> >
> 
> Fair enough, I have changed it in the attached patch.  However, I
> think we should test it once the patch is ready as we have seen a
> small performance regression due to that.

Sure, but that was because you re-scanned from scratch after every
insertion, no?


> > And sure, leaving that aside we could store one byte per block
> 
> Hmm, I think you mean to say one-bit per block, right?

No, I meant byte. The normal FSM saves how much space there is for a
block, but the current local fsm doesn't. That means pages are marked as
unavailble even though other tuples would possibly fit.


> >  It's possible that'd come with
> > some overhead - I've not thought sufficiently about it: I assume we'd
> > still start out in each backend assuming each page is empty, and we'd
> > then rely on RelationGetBufferForTuple() to update that. What I wonder
> > is if we'd need to check if an on-disk FSM has been created every time
> > the space on a page is reduced?  I think not, if we use invalidations to
> > notify others of the existance of an on-disk FSM. There's a small race,
> > but that seems ok.

> Do you mean to say that vacuum or some backend should invalidate in
> case it first time creates the FSM for a relation?

Right.


> I think it is quite possible that the vacuum takes time to trigger
> such invalidation if there are fewer deletions.  And we won't be able
> to use newly added page/s.

I'm not sure I understand what you mean by that? If the backend that
ends up creating the FSM - because it extended the relation beyond 4
pages - issues an invalidation, the time till other backends pick that
up should be minimal?


> IIUC, you are suggesting to issue invalidations when the (a) vacuum
> finds there is no FSM and page has more space now, (b) invalidation to
> notify the existence of FSM.  IT seems to me that issuing more
> invalidations for the purpose of FSM can lead to an increased number
> of relation builds in the overall system.  I think this can have an
> impact when there are many small relations in the system which in some
> scenarios might not be uncommon.

If this becomes an issue I'd just create a separate type of
invalidation, one that just signals that the FSM is being invalidated.


Greetings,

Andres Freund




Re: block-level incremental backup

2019-04-24 Thread Robert Haas
On Wed, Apr 24, 2019 at 9:28 AM Stephen Frost  wrote:
> Looking at it from what I'm sitting, I brought up two ways that we
> could extend the protocol to "request from the server only those blocks
> where LSN >= threshold_value" with one being the modification to
> BASE_BACKUP and the other being a new set of commands that could be
> parallelized.  If I had assumed that you'd be thinking the same way I am
> about extending the backup protocol, I wouldn't have said anything now
> and then would have complained after you wrote a patch that just
> extended the BASE_BACKUP command, at which point I likely would have
> been told that it's now been done and that I should have mentioned it
> earlier.

Fair enough.

> At least in part then it seems like we're viewing the level of effort
> around what I'm talking about quite differently, and I feel like that's
> largely because every time I mention parallel anything there's this
> assumption that I'm asking you to parallelize pg_basebackup or write a
> whole bunch more code to provide a fully optimized server-side parallel
> implementation for backups.  That really wasn't what I was going for.  I
> was thinking it would be a modest amount of additional work add
> incremental backup via a few new commands, instead of through the
> BASE_BACKUP protocol command, that would make parallelization possible.

I'm not sure about that.  It doesn't seem crazy difficult, but there
are a few wrinkles.  One is that if the client is requesting files one
at a time, it's got to have a list of all the files that it needs to
request, and that means that it has to ask the server to make a
preparatory pass over the whole PGDATA directory to get a list of all
the files that exist.  That overhead is not otherwise needed.  Another
is that the list of files might be really large, and that means that
the client would either use a lot of memory to hold that great big
list, or need to deal with spilling the list to a spool file
someplace, or else have a server protocol that lets the list be
fetched in incrementally in chunks.  A third is that, as you mention
further on, it means that the client has to care a lot more about
exactly how the server is figuring out which blocks have been
modified.  If it just says BASE_BACKUP ..., the server an be
internally reading each block and checking the LSN, or using
WAL-scanning or ptrack or whatever and the client doesn't need to know
or care.  But if the client is asking for a list of modified files or
blocks, then that presumes the information is available, and not too
expensively, without actually reading the files.  Fourth, MAX_RATE
probably won't actually limit to the correct rate overall if the limit
is applied separately to each file.

I'd be afraid that a patch that tried to handle all that as part of
this project would get rejected on the grounds that it was trying to
solve too many unrelated problems.  Also, though not everybody has to
agree on what constitutes a "modest amount of additional work," I
would not describe solving all of those problems as a modest effort,
but rather a pretty substantial one.

> There's a tangent on all of this that's pretty key though, which is the
> question around just how the blocks are identified.  If the WAL scanning
> is done to figure out the blocks, then that's quite a bit different from
> the other idea of "open this relation and scan it, but only give me the
> blocks after this LSN".  It's the latter case that I've been mostly
> thinking about in this thread, which is part of why I was thinking it'd
> be a modest amount of work to have protocol commands that accepted a
> file (or perhaps a relation..) to scan and return blocks from instead of
> baking this into BASE_BACKUP which by definition just serially scans the
> data directory and returns things as it finds them.  For the case where
> we have WAL scanning happening and modfiles which are being read and
> used to figure out the blocks to send, it seems like it might be more
> complicated and therefore potentially quite a bit more work to have a
> parallel version of that.

Yeah.  I don't entirely agree that the first one is simple, as per the
above, but I definitely agree that the second one is more complicated
than the first one.

> > Well, one thing you might want to do is have a tool that connects to
> > the server, enters backup mode, requests information on what blocks
> > have changed, copies those blocks via direct filesystem access, and
> > then exits backup mode.  Such a tool would really benefit from a
> > START_BACKUP / SEND_FILE_LIST / SEND_FILE_CONTENTS / STOP_BACKUP
> > command language, because it would just skip ever issuing the
> > SEND_FILE_CONTENTS command in favor of doing that part of the work via
> > other means.  On the other hand, a START_PARALLEL_BACKUP LSN '1/234'
> > command is useless to such a tool.
>
> That's true, but I hardly ever hear people talking about how wonderful
> it is that pgBackRest uses SSH to grab the 

Re: set relispartition when attaching child index

2019-04-24 Thread Amit Langote
On Thu, Apr 25, 2019 at 12:39 AM Amit Langote  wrote:
> On Thu, Apr 25, 2019 at 12:38 AM Amit Langote  wrote:
> > On Thu, Apr 25, 2019 at 12:35 AM Alvaro Herrera
> >  wrote:
> > > On 2019-Apr-25, Amit Langote wrote:
> > >
> > > > It seems that DefineIndex() is forgetting to update_relispartition()
> > > > on a partition's index when it's attached to an index being added to
> > > > the parent.  That results in unexpected behavior when adding a foreign
> > > > key referencing the parent.
> > >
> > > BTW, maybe IndexSetParentIndex ought to be the one calling
> > > update_relispartition() ...
> >
> > I thought so too, but other sites are doing what I did in the patch.
>
> Although, we wouldn't have this bug if it was IndexSetParentIndex
> calling it.  Maybe a good idea to do that now.

I tried that in the attached.

Thanks,
Amit


IndexSetParentIndex-update_relispartition.patch
Description: Binary data


Re: set relispartition when attaching child index

2019-04-24 Thread Amit Langote
On Thu, Apr 25, 2019 at 12:38 AM Amit Langote  wrote:
> On Thu, Apr 25, 2019 at 12:35 AM Alvaro Herrera
>  wrote:
> > On 2019-Apr-25, Amit Langote wrote:
> >
> > > It seems that DefineIndex() is forgetting to update_relispartition()
> > > on a partition's index when it's attached to an index being added to
> > > the parent.  That results in unexpected behavior when adding a foreign
> > > key referencing the parent.
> >
> > BTW, maybe IndexSetParentIndex ought to be the one calling
> > update_relispartition() ...
>
> I thought so too, but other sites are doing what I did in the patch.

Although, we wouldn't have this bug if it was IndexSetParentIndex
calling it.  Maybe a good idea to do that now.

Thanks,
Amit




Re: set relispartition when attaching child index

2019-04-24 Thread Amit Langote
On Thu, Apr 25, 2019 at 12:35 AM Alvaro Herrera
 wrote:
> On 2019-Apr-25, Amit Langote wrote:
>
> > It seems that DefineIndex() is forgetting to update_relispartition()
> > on a partition's index when it's attached to an index being added to
> > the parent.  That results in unexpected behavior when adding a foreign
> > key referencing the parent.
>
> BTW, maybe IndexSetParentIndex ought to be the one calling
> update_relispartition() ...

I thought so too, but other sites are doing what I did in the patch.

Thanks,
Amit




Re: set relispartition when attaching child index

2019-04-24 Thread Alvaro Herrera
On 2019-Apr-25, Amit Langote wrote:

> It seems that DefineIndex() is forgetting to update_relispartition()
> on a partition's index when it's attached to an index being added to
> the parent.  That results in unexpected behavior when adding a foreign
> key referencing the parent.

BTW, maybe IndexSetParentIndex ought to be the one calling
update_relispartition() ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: set relispartition when attaching child index

2019-04-24 Thread Alvaro Herrera
On 2019-Apr-25, Amit Langote wrote:

> It seems that DefineIndex() is forgetting to update_relispartition()
> on a partition's index when it's attached to an index being added to
> the parent.  That results in unexpected behavior when adding a foreign
> key referencing the parent.

Ah, thanks for fixing.  I also read Depesz's post this morning and was
to see what was going on after I push the pg_dump fix.

I'll get this pushed later.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




set relispartition when attaching child index

2019-04-24 Thread Amit Langote
Hi,

It seems that DefineIndex() is forgetting to update_relispartition()
on a partition's index when it's attached to an index being added to
the parent.  That results in unexpected behavior when adding a foreign
key referencing the parent.

create table foo (a int) partition by list (a);
create table foo1 partition of foo for values in (1);
alter table foo1 add primary key (a);
alter table foo add primary key (a);
select relname, relispartition from pg_class where relname = 'foo1_pkey';
  relname  | relispartition
---+
 foo1_pkey | f
(1 row)

create table bar (a int references foo);
ERROR:  index for 24683 not found in partition foo1

Attached patch fixes that, but I haven't added any new tests.

PS: Came to know that that's the case when reading this blog on the
new foreign key feature:
https://www.depesz.com/2019/04/24/waiting-for-postgresql-12-support-foreign-keys-that-reference-partitioned-tables/

Thanks,
Amit


DefineIndex-update_relispartition.patch
Description: Binary data


Re: Help to review the with X cursor option.

2019-04-24 Thread Tom Lane
alex lock  writes:
> The cursor means  something like  declare c cursor for select * from t;
> The holdable cursor means declare c cursor WITH HOLD for select * from t;

> Holdable cursor is good at transaction,  user can still access it after the
> transaction is commit.  But it is bad at it have to save all the record to
> tuple store before we fetch 1 row.

> what I want is:
> 1.   The cursor is still be able to fetch after the transaction is
> committed.
> 2.   the cursor will not fetch the data when fetch statement is issue (just
> like non-holdable cursor).

> I called this as with X cursor..

> I check the current implementation and think it would be possible with the
> following methods:
> 1.   allocate the memory  in a  {LongerMemoryContext}, like EState  to
> prevent they are
> 2.   allocate a more bigger resource owner to prevent the LockReleaseAll
> during CommitTransaction.
> 3.   add the "with X" option to cursor so that Precommit_portals will not
> drop it during CommitTransaction.

> Before I implement it,  could you give some suggestions?

You don't actually understand the problem.

The reason a holdable cursor forcibly reads all the data before commit is
that the data might not be there to read any later than that.  Once we end
the transaction and release its snapshot (specifically, advance the
backend's advertised global xmin), it's possible and indeed desirable for
obsoleted row versions to be vacuumed.  The only way to avoid that would
be to not advance xmin, which is pretty much just as bad as not committing
the transaction.  Not releasing the transaction's locks is also bad.
So it doesn't seem like there's anything to be gained here that you don't
have today by just not committing yet.

If you're concerned about not losing work due to possible errors later in
the transaction, you could prevent those from causing problems through
subtransactions (savepoints).

regards, tom lane




Re: Identity columns should own only one sequence

2019-04-24 Thread Laurenz Albe
On Sun, 2019-04-14 at 20:15 +0200, I wrote:
> I wrote:
> > Identity columns don't work if they own more than one sequence.
> 
> Alternatively, maybe getOwnedSequence should only consider sequences
> with an "internal" dependency on the column.  That would avoid the problem
> without forbidding anything, since normal OWNED BY dependencies are "auto".
> 
> What do you think?

Here is a patch that illustrates the second approach.

I'll add this thread to the next commitfest.

Yours,
Laurenz Albe

From 7f7bae5315b7770f1327a80eb192bb098ee9df89 Mon Sep 17 00:00:00 2001
From: Laurenz Albe 
Date: Wed, 24 Apr 2019 16:46:39 +0200
Subject: [PATCH] Change getOwnedSequence to only find sequences for identity
 columns

This makes identity columns work even if there is another sequence
owned by the column (with an auto dependency).
---
 src/backend/catalog/pg_depend.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/pg_depend.c b/src/backend/catalog/pg_depend.c
index d63bf5e56d..4d8c333243 100644
--- a/src/backend/catalog/pg_depend.c
+++ b/src/backend/catalog/pg_depend.c
@@ -684,14 +684,18 @@ getOwnedSequences(Oid relid, AttrNumber attnum)
 		Form_pg_depend deprec = (Form_pg_depend) GETSTRUCT(tup);
 
 		/*
-		 * We assume any auto or internal dependency of a sequence on a column
-		 * must be what we are looking for.  (We need the relkind test because
-		 * indexes can also have auto dependencies on columns.)
+		 * If no "attnum" was given, we are looking for sequences with an
+		 * auto or internal dependency.
+		 * If "attnum" was given, only look for sequences with an internal
+		 * dependency, because that is what we need for identity columns.
+		 * (We need the relkind test because indexes can also have auto
+		 * dependencies on columns.)
 		 */
 		if (deprec->classid == RelationRelationId &&
 			deprec->objsubid == 0 &&
 			deprec->refobjsubid != 0 &&
-			(deprec->deptype == DEPENDENCY_AUTO || deprec->deptype == DEPENDENCY_INTERNAL) &&
+			((!attnum && deprec->deptype == DEPENDENCY_AUTO) ||
+deprec->deptype == DEPENDENCY_INTERNAL) &&
 			get_rel_relkind(deprec->objid) == RELKIND_SEQUENCE)
 		{
 			result = lappend_oid(result, deprec->objid);
-- 
2.20.1



Re: Patch: doc for pg_logical_emit_message()

2019-04-24 Thread Fujii Masao
On Wed, Apr 24, 2019 at 11:12 AM Matsumura, Ryo
 wrote:
>
> On Tue. Apr. 23, 2019 at 02:59 AM Masao, Fujii
>  wrote:
>
> Thank you for the comment.
>
> > So I think that the patch should fix also the description for those
> > replication functions. Thought?
>
> I think so too.
> I attach a new patch.

Thanks for updating the patch!

+Use of functions for replication origin is restricted to superusers.
+Use of functions for replication slot is restricted to superusers
and replication roles.

"replication role" is a bit confusing. For example, we have
"replication role" related to session_replication_role. So
I think it's better to use something like "users having
REPLICATION privilege".

+Only pg_logical_emit_message can be used by any users.

Not any user, I think. For example, what about a user not having
EXECUTE privilege on pg_logical_emit_message function?
I don't think that this description only for pg_logical_emit_message
is necessary.

Regards,

-- 
Fujii Masao




Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-24 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Apr-24, Paul Guo wrote:
>> On Wed, Apr 24, 2019 at 12:49 PM Andres Freund  wrote:
>>> This seems like a bad idea to me. IMO we want to support using a pipe
>>> etc here. If the admin creates a fifo like this without attaching a
>>> program it seems like it's their fault.

>> Oh, I never know this application scenario before. So yes, for this, we
>> need to keep the current code logic in copy code.

> But the pgstat.c patch seems reasonable to me.

Nah, I don't buy that one either.  Nobody has any business creating any
non-Postgres files in the stats directory ... and if somebody does want
to stick a FIFO in there, perhaps for debug purposes, why should we stop
them?

The case with COPY is a bit different, since there it's reasonable to be
worried about collisions with other users' files --- but I agree with
Andres that this change would eliminate too many valid use-cases.

regards, tom lane




Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Tom Lane
Kyotaro HORIGUCHI  writes:
> At Wed, 24 Apr 2019 13:23:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20190424.132304.40676137.horiguchi.kyot...@lab.ntt.co.jp>
>> We need to adjust relative path between PGDATA-based and
>> pg_tblspc based. The attached first patch does that.

> This is new version. Adjusted pg_basebackup's behavior to allow
> relative mappings. But..

I can't say that I like 0001 at all.  It adds a bunch of complication and
new failure modes (e.g., having to panic on chdir failure) in order to do
what exactly?  I've not been following the thread closely, but the
original problem is surely just a dont-do-that misconfiguration.  I also
suspect that this is assuming way too much about the semantics of getcwd
--- some filesystem configurations may have funny situations like multiple
paths to the same place.

0004 also seems like it's adding at least as many failure modes as
it removes.  Moreover, isn't it just postponing the failure a little?
Later WAL might well try to touch the directory you skipped creation of.
We can't realistically decide that all WAL-application errors are
ignorable, but that seems like the direction this would have us go in.

regards, tom lane




Re: finding changed blocks using WAL scanning

2019-04-24 Thread Tomas Vondra

On Wed, Apr 24, 2019 at 09:25:12AM -0400, Robert Haas wrote:

On Mon, Apr 22, 2019 at 9:51 PM Robert Haas  wrote:

For this particular use case, wouldn't you want to read the WAL itself
and use that to issue prefetch requests?  Because if you use the
.modblock files, the data file blocks will end up in memory but the
WAL blocks won't, and you'll still be waiting for I/O.


I'm still interested in the answer to this question, but I don't see a
reply that specifically concerns it.  Apologies if I have missed one.



I don't think prefetching WAL blocks is all that important. The WAL
segment was probably received fairly recently (either from primary or
archive) and so it's reasonable to assume it's still in page cache. And
even if it's not, sequential reads are handled by readahead pretty well.
Which is a form of prefetching.

But even if WAL prefetching was useful in some cases, I think it's mostly
orthogonal issue - it certainly does not make prefetching of data pages
unnecessary.


Stepping back a bit, I think that the basic issue under discussion
here is how granular you want your .modblock files.  At one extreme,
one can imagine an application that wants to know exactly which blocks
were accessed at exact which LSNs.  At the other extreme, if you want
to run a daily incremental backup, you just want to know which blocks
have been modified between the start of the previous backup and the
start of the current backup - i.e. sometime in the last ~24 hours.
These are quite different things.  When you only want approximate
information - is there a chance that this block was changed within
this LSN range, or not? - you can sort and deduplicate in advance;
when you want exact information, you cannot do that.  Furthermore, if
you want exact information, you must store an LSN for every record; if
you want approximate information, you emit a file for each LSN range
and consider it sufficient to know that the change happened somewhere
within the range of LSNs encompassed by that file.



Those are the extreme design options, yes. But I think there may be a
reasonable middle ground, that would allow using the modblock files for
both use cases.


It's pretty clear in my mind that what I want to do here is provide
approximate information, not exact information.  Being able to sort
and deduplicate in advance seems critical to be able to make something
like this work on high-velocity systems.


Do you have any analysis / data to support that claim? I mean, it's
obvious that sorting and deduplicating the data right away makes
subsequent processing more efficient, but it's not clear to me that not
doing it would make it useless for high-velocity systems.


If you are generating a
terabyte of WAL between incremental backups, and you don't do any
sorting or deduplication prior to the point when you actually try to
generate the modified block map, you are going to need a whole lot of
memory (and CPU time, though that's less critical, I think) to process
all of that data.  If you can read modblock files which are already
sorted and deduplicated, you can generate results incrementally and
send them to the client incrementally and you never really need more
than some fixed amount of memory no matter how much data you are
processing.



Sure, but that's not what I proposed elsewhere in this thread. My proposal
was to keep mdblocks "raw" for WAL segments that were not recycled yet (so
~3 last checkpoints), and deduplicate them after that. So vast majority of
the 1TB of WAL will have already deduplicated data.

Also, maybe we can do partial deduplication, in a way that would be useful
for prefetching. Say we only deduplicate 1MB windows - that would work at
least for cases that touch the same page frequently (say, by inserting to
the tail of an index, or so).


While I'm convinced that this particular feature should provide
approximate rather than exact information, the degree of approximation
is up for debate, and maybe it's best to just make that configurable.
Some applications might work best with small modblock files covering
only ~16MB of WAL each, or even less, while others might prefer larger
quanta, say 1GB or even more.  For incremental backup, I believe that
the quanta will depend on the system velocity.  On a system that isn't
very busy, fine-grained modblock files will make incremental backup
more efficient.  If each modblock file covers only 16MB of data, and
the backup manages to start someplace in the middle of that 16MB, then
you'll only be including 16MB or less of unnecessary block references
in the backup so you won't incur much extra work.  On the other hand,
on a busy system, you probably do not want such a small quantum,
because you will then up with gazillions of modblock files and that
will be hard to manage.  It could also have performance problems,
because merging data from a couple of hundred files is fine, but
merging data from a couple of hundred thousand files is going to be
inefficient.  My 

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-24 Thread Laurenz Albe
On Wed, 2019-04-24 at 14:25 +0100, Simon Riggs wrote:
> On Wed, 24 Apr 2019 at 12:55, Etsuro Fujita  
> wrote:
>  
> > > My point is that this should not be necessary.
> > 
> > In my opinion, I think this is necessary...
> 
> Could we decide by looking at what FDWs are known to exist?
> I hope we are trying to avoid breakage in the smallest number of FDWs.

A good idea.  I don't volunteer to go through the list, but I had a look
at Multicorn, which is a FDW framework used by many FDWs, and it seems
to rely on multicornBeginForeignModify being called before
multicornExecForeignInsert (the former sets up a MulticornModifyState
used by the latter).

https://github.com/Kozea/Multicorn/blob/master/src/multicorn.c

Multicorn obviously hasn't got the message yet that the API has
changed in an incompatible fashion, so I'd argue that every
Multicorn FDW with write support is currently broken.


As Andres has argued above, it is too late to do anything more about
it than to document this and warn FDW authors as good as we can.

Yours,
Laurenz Albe





Re: block-level incremental backup

2019-04-24 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 22, 2019 at 2:26 PM Stephen Frost  wrote:
> > There was basically zero discussion about what things would look like at
> > a protocol level (I went back and skimmed over the thread before sending
> > my last email to specifically see if I was going to get this response
> > back..).  I get the idea behind the diff file, the contents of which I
> > wasn't getting into above.
> 
> Well, I wrote:
> 
> "There should be a way to tell pg_basebackup to request from the
> server only those blocks where LSN >= threshold_value."
> 
> I guess I assumed that people would interested in the details take
> that to mean "and therefore the protocol would grow an option for this
> type of request in whatever way is the most straightforward possible
> extension of the current functionality is," which is indeed how you
> eventually interpreted it when you said we could "extend BASE_BACKUP
> is by adding LSN as an optional parameter."

Looking at it from what I'm sitting, I brought up two ways that we
could extend the protocol to "request from the server only those blocks
where LSN >= threshold_value" with one being the modification to
BASE_BACKUP and the other being a new set of commands that could be
parallelized.  If I had assumed that you'd be thinking the same way I am
about extending the backup protocol, I wouldn't have said anything now
and then would have complained after you wrote a patch that just
extended the BASE_BACKUP command, at which point I likely would have
been told that it's now been done and that I should have mentioned it
earlier.

> > external tools to leverage that.  It sounds like what you're suggesting
> > now is that you're happy to implement the backend code, expose it in a
> > way that works just for pg_basebackup, and that if someone else wants to
> > add things to the protocol to make it easier for external tools to
> > leverage, great.
> 
> Yep, that's more or less it, although I am potentially willing to do
> some modest amount of that other work along the way.  I just don't
> want to prioritize it higher than getting the actual thing I want to
> build built, which I think is a pretty fair position for me to take.

At least in part then it seems like we're viewing the level of effort
around what I'm talking about quite differently, and I feel like that's
largely because every time I mention parallel anything there's this
assumption that I'm asking you to parallelize pg_basebackup or write a
whole bunch more code to provide a fully optimized server-side parallel
implementation for backups.  That really wasn't what I was going for.  I
was thinking it would be a modest amount of additional work add
incremental backup via a few new commands, instead of through the
BASE_BACKUP protocol command, that would make parallelization possible.

Now, through this discussion, you've brought up some really good points
about how the initial thoughts I had around how we could add some
relatively simple commands, as part of this work, to make it easier for
someone to later add parallel support to pg_basebackup (either full or
incremental), or for external tools to leverage, might not be the best
solution when it comes to having parallel backup in core, and therefore
wouldn't actually end up being useful towards that end.  That's
certainly a fair point and possibly enough to justify not spending even
the modest time I was thinking it'd need, but I'm not convinced.  Now,
that said, if you are convinced that's the case, and you're doing the
work, then it's certainly your prerogative to go in the direction you're
convinced of.  I don't mean any of this discussion to imply that I'd
object to a commit that extended BASE_BACKUP in the way outlined above,
but I understood the question to be "what do people think of this idea?"
and to that I'm still of the opinion that spending a modest amount of
time to provide a way to parallelize an incremental backup is worth it,
even if it isn't optimal and isn't the direct goal of this effort.

There's a tangent on all of this that's pretty key though, which is the
question around just how the blocks are identified.  If the WAL scanning
is done to figure out the blocks, then that's quite a bit different from
the other idea of "open this relation and scan it, but only give me the
blocks after this LSN".  It's the latter case that I've been mostly
thinking about in this thread, which is part of why I was thinking it'd
be a modest amount of work to have protocol commands that accepted a
file (or perhaps a relation..) to scan and return blocks from instead of
baking this into BASE_BACKUP which by definition just serially scans the
data directory and returns things as it finds them.  For the case where
we have WAL scanning happening and modfiles which are being read and
used to figure out the blocks to send, it seems like it might be more
complicated and therefore potentially quite a bit more work to have a
parallel 

Help to review the with X cursor option.

2019-04-24 Thread alex lock
The cursor means  something like  declare c cursor for select * from t;
The holdable cursor means declare c cursor WITH HOLD for select * from t;

Holdable cursor is good at transaction,  user can still access it after the
transaction is commit.  But it is bad at it have to save all the record to
tuple store before we fetch 1 row.

what I want is:
1.   The cursor is still be able to fetch after the transaction is
committed.
2.   the cursor will not fetch the data when fetch statement is issue (just
like non-holdable cursor).

I called this as with X cursor..

I check the current implementation and think it would be possible with the
following methods:
1.   allocate the memory  in a  {LongerMemoryContext}, like EState  to
prevent they are
2.   allocate a more bigger resource owner to prevent the LockReleaseAll
during CommitTransaction.
3.   add the "with X" option to cursor so that Precommit_portals will not
drop it during CommitTransaction.

Before I implement it,  could you give some suggestions?

Thanks!


Re: finding changed blocks using WAL scanning

2019-04-24 Thread Robert Haas
On Mon, Apr 22, 2019 at 9:51 PM Robert Haas  wrote:
> For this particular use case, wouldn't you want to read the WAL itself
> and use that to issue prefetch requests?  Because if you use the
> .modblock files, the data file blocks will end up in memory but the
> WAL blocks won't, and you'll still be waiting for I/O.

I'm still interested in the answer to this question, but I don't see a
reply that specifically concerns it.  Apologies if I have missed one.

Stepping back a bit, I think that the basic issue under discussion
here is how granular you want your .modblock files.  At one extreme,
one can imagine an application that wants to know exactly which blocks
were accessed at exact which LSNs.  At the other extreme, if you want
to run a daily incremental backup, you just want to know which blocks
have been modified between the start of the previous backup and the
start of the current backup - i.e. sometime in the last ~24 hours.
These are quite different things.  When you only want approximate
information - is there a chance that this block was changed within
this LSN range, or not? - you can sort and deduplicate in advance;
when you want exact information, you cannot do that.  Furthermore, if
you want exact information, you must store an LSN for every record; if
you want approximate information, you emit a file for each LSN range
and consider it sufficient to know that the change happened somewhere
within the range of LSNs encompassed by that file.

It's pretty clear in my mind that what I want to do here is provide
approximate information, not exact information.  Being able to sort
and deduplicate in advance seems critical to be able to make something
like this work on high-velocity systems.  If you are generating a
terabyte of WAL between incremental backups, and you don't do any
sorting or deduplication prior to the point when you actually try to
generate the modified block map, you are going to need a whole lot of
memory (and CPU time, though that's less critical, I think) to process
all of that data.  If you can read modblock files which are already
sorted and deduplicated, you can generate results incrementally and
send them to the client incrementally and you never really need more
than some fixed amount of memory no matter how much data you are
processing.

While I'm convinced that this particular feature should provide
approximate rather than exact information, the degree of approximation
is up for debate, and maybe it's best to just make that configurable.
Some applications might work best with small modblock files covering
only ~16MB of WAL each, or even less, while others might prefer larger
quanta, say 1GB or even more.  For incremental backup, I believe that
the quanta will depend on the system velocity.  On a system that isn't
very busy, fine-grained modblock files will make incremental backup
more efficient.  If each modblock file covers only 16MB of data, and
the backup manages to start someplace in the middle of that 16MB, then
you'll only be including 16MB or less of unnecessary block references
in the backup so you won't incur much extra work.  On the other hand,
on a busy system, you probably do not want such a small quantum,
because you will then up with gazillions of modblock files and that
will be hard to manage.  It could also have performance problems,
because merging data from a couple of hundred files is fine, but
merging data from a couple of hundred thousand files is going to be
inefficient.  My experience hacking on and testing tuplesort.c a few
years ago (with valuable tutelage by Peter Geoghegan) showed me that
there is a slow drop-off in efficiency as the merge order increases --
and in this case, at some point you will blow out the size of the OS
file descriptor table and have to start opening and closing files
every time you access a different one, and that will be unpleasant.
Finally, deduplication will tend to be more effective across larger
numbers of block references, at least on some access patterns.

So all of that is to say that if somebody wants modblock files each of
which covers 1MB of WAL, I think that the same tools I'm proposing to
build here for incremental backup could support that use case with
just a configuration change.  Moreover, the resulting files would
still be usable by the incremental backup engine.  So that's good: the
same system can, at least to some extent, be reused for whatever other
purposes people want to know about modified blocks.  On the other
hand, the incremental backup engine will likely not cope smoothly with
having hundreds of thousands or millions of modblock files shoved down
its gullet, so if there is a dramatic difference in the granularity
requirements of different consumers, another approach is likely
indicated.  Especially if some consumer wants to see block references
in the exact order in which they appear in WAL, or wants to know the
exact LSN of each reference, it's probably best to go for a different
approach.  For 

Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-24 Thread Simon Riggs
On Wed, 24 Apr 2019 at 12:55, Etsuro Fujita 
wrote:


> > My point is that this should not be necessary.
>
> In my opinion, I think this is necessary...
>

Could we decide by looking at what FDWs are known to exist?  I hope we are
trying to avoid breakage in the smallest number of FDWs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-24 Thread Laurenz Albe
On Wed, 2019-04-24 at 20:54 +0900, Etsuro Fujita wrote:
> How about adding to the documentation for BeginForeignInsert a mention 
> that if an FDW doesn't support COPY FROM and/or routable foreign tables, 
> it must throw an error in BeginForeignInsert accordingly.

Sure, some more documentation would be good.

The documentation of ExecForeignInsert should mention something like:

  ExecForeignInsert is called for INSERT statements as well
  as for COPY FROM and tuples that are inserted into a foreign table
  because it is a partition of a partitioned table.

  In the case of a normal INSERT, BeginForeignModify will be called
  before ExecForeignInsert to perform any necessary setup.
  In the other cases, this setup has to happen in BeginForeignInsert.

  Before PostgreSQL v11, a foreign data wrapper could be certain that
  BeginForeignModify is always called before ExecForeignInsert.
  This is no longer true.

> > On the other hand, if a FDW wants to support COPY in v11 and has no
> > need for BeginForeignInsert to support that, it is a simple exercise
> > for it to provide an empty BeginForeignInsert just to signal that it
> > wants to support COPY.
> 
> That seems to me inconsistent with the concept of the existing APIs for 
> updating foreign tables, because for an FDW that wants to support 
> INSERT/UPDATE/DELETE and has no need for 
> PlanForeignModify/BeginForeignModify, those APIs don't require the FDW 
> to provide empty PlanForeignModify/BeginForeignModify to tell the core 
> that it wants to support INSERT/UPDATE/DELETE.

That is true, but so far there hasn't been a change to the FDW API that
caused a callback to be invoked in a different fashion than it used to be.

Perhaps it would have been better to create a new callback like
ExecForeignCopy with the same signature as ExecForeignInsert so that
you can use the same callback function for both if you want.
That would also have avoided the breakage.
But, of course it is too late for that now.

Note that postgres_fdw would have been broken by that API change as well
if it hasn't been patched.

At the very least, this should have been mentioned in the list of
incompatible changes for v11.

Yours,
Laurenz Albe





Re: [Patch] Check file type before calling AllocateFile() for files out of pg data directory to avoid potential issues (e.g. hang).

2019-04-24 Thread Alvaro Herrera
On 2019-Apr-24, Paul Guo wrote:

> On Wed, Apr 24, 2019 at 12:49 PM Andres Freund  wrote:

> > This seems like a bad idea to me. IMO we want to support using a pipe
> > etc here. If the admin creates a fifo like this without attaching a
> > program it seems like it's their fault.
> 
> Oh, I never know this application scenario before. So yes, for this, we
> need to keep the current code logic in copy code.

But the pgstat.c patch seems reasonable to me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Optimizer items in the release notes

2019-04-24 Thread David Rowley
On Tue, 23 Apr 2019 at 04:54, Bruce Momjian  wrote:
>
> We had a discussion in October about adding more optimizer items to the
> release notes:
>
> 
> https://www.postgresql.org/message-id/flat/20181010220601.GA7807%40momjian.us#11d805ea0b0fcd0552dfa99251417cc1
>
> There was no agreement on a change, but if people want to propose a
> change, please post here and we can discuss it.

I'd say these sorts of changes are important. TBH, these days, I think
query planner smarts are arguably one of our weakest areas when
compared to the big commercial databases. The more we can throw in
there about this sort of thing the better.  The strange thing about
making improvements to the planner is often that the benefits of doing
so can range massively. e.g has zero effect all the way up to perhaps
thousands or even millions of times faster.  The users that get the 1
million times speedup will likely want to know that more than some
executor speedup that gets them 5% across the board.  I believe these
are useful to keep in the release notes to catch the eye of all those
people blocked from upgrading from  due to us not
optimising their queries the same way as they and their applications
are accustomed to.

I see from the v11 release notes that we have "E.3.3.1.4. Optimizer",
it seems fairly simple for someone to skip this if they happen not to
be interested in what's been changed in that area.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Thoughts on nbtree with logical/varwidth table identifiers, v12 on-disk representation

2019-04-24 Thread Robert Haas
On Mon, Apr 22, 2019 at 1:16 PM Peter Geoghegan  wrote:
> Yes, though that should probably work by reusing what we already do
> with heap TID (use standard IndexTuple fields on the leaf level for
> heap TID), plus an additional identifier for the partition number that
> is located at the physical end of the tuple. IOW, I think that this
> might benefit from a design that is half way between what we already
> do with heap TIDs and what we would be required to do to make varwidth
> logical row identifiers in tables work -- the partition number is
> varwidth, though often only a single byte.

I think we're likely to have a problem with global indexes + DETACH
PARTITION that is similar to the problem we now have with DROP COLUMN.

If you drop or detach a partition, you can either (a) perform, as part
of that operation, a scan of every global index to remove all
references to the former partition, or (b) tell each global indexes
that all references to that partition number ought to be regarded as
dead index tuples.  (b) makes detaching partitions faster and (a)
seems hard to make rollback-safe, so I'm guessing we'll end up with
(b).

But that means that if someone repeatedly attaches and detaches
partitions, the partition numbers could get quite big.  And even
without that somebody could have a lot of partitions.  So while I do
not disagree that the partition number could be variable-width and
sometimes only 1 payload byte, I think we had better make sure to
design the system in such a way that it scales to at least 4 payload
bytes, because I have no faith that anything less will be sufficient
for our demanding user base.

We don't want people to be able to exhaust the supply of partition
numbers the way they can exhaust the supply of attribute numbers by
adding and dropping columns repeatedly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Pluggable Storage - Andres's take

2019-04-24 Thread Robert Haas
On Tue, Apr 23, 2019 at 6:55 PM Tom Lane  wrote:
> Andres Freund  writes:
> > ... I think none of these are critical issues for tableam, but we should fix
> > them.
>
> > I'm not sure about doing so for v12 though. 1) and 3) are fairly
> > trivial, but 2) would involve changing the FDW interface, by changing
> > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> > we're not even in beta1.
>
> Probably better to fix those API issues now rather than later.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Allow insert and update tuple routing and COPY for foreign table

2019-04-24 Thread Etsuro Fujita

(2019/04/23 4:37), Laurenz Albe wrote:

On Mon, 2019-04-22 at 21:45 +0900, Etsuro Fujita wrote:

(2019/04/20 20:53), Laurenz Albe wrote:

On Fri, 2018-04-06 at 23:24 +, Robert Haas wrote:

Allow insert and update tuple routing and COPY for foreign tables.

Also enable this for postgres_fdw.

Etsuro Fujita, based on an earlier patch by Amit Langote. The larger
patch series of which this is a part has been reviewed by Amit
Langote, David Fetter, Maksim Milyutin, Álvaro Herrera, Stephen Frost,
and me.  Minor documentation changes to the final version by me.

Discussion: 
http://postgr.es/m/29906a26-da12-8c86-4fb9-d8f88442f...@lab.ntt.co.jp



If a FDW implements ExecForeignInsert, this commit automatically assumes
that it also supports COPY FROM.  It will call ExecForeignInsert without
calling PlanForeignModify and BeginForeignModify, and a FDW that does not
expect that will probably fail.


This is not 100% correct; the FDW documentation says:

  
   Tuples inserted into a partitioned table by
INSERT  or
   COPY FROM  are routed to partitions.  If an FDW
   supports routable foreign-table partitions, it should also provide the
   following callback functions.  These functions are also called when
   COPY FROM  is executed on a foreign table.
  


I don't see the difference between the documentation and what I wrote above.

Before v11, a FDW could expect that ExecForeignInsert is only called if
BeginForeignModify was called earlier.
That has silently changed with v11.


I have to admit that the documentation is not sufficient.


It's permissible to throw an error in BeginForeignInsert, so what I was
thinking for FDWs that don't want to support COPY FROM and
INSERT/UPDATE/COPY FROM tuple routing was to provide BeginForeignInsert
implementing something like this:

static void
fooBeginForeignInsert(ModifyTableState *mtstate,
ResultRelInfo *resultRelInfo)
{
  Relationrel = resultRelInfo->ri_RelationDesc;

  if (mtstate->ps.plan == NULL)
  ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg("cannot copy to foreign table \"%s\"",
  RelationGetRelationName(rel;
  else
  ereport(ERROR,
  (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
   errmsg("cannot route tuples into foreign table \"%s\"",
  RelationGetRelationName(rel;
}


Sure, it is not hard to modify a FDW to continue working with v11.


How about adding to the documentation for BeginForeignInsert a mention 
that if an FDW doesn't support COPY FROM and/or routable foreign tables, 
it must throw an error in BeginForeignInsert accordingly.



My point is that this should not be necessary.


In my opinion, I think this is necessary...


On the other hand, if a FDW wants to support COPY in v11 and has no
need for BeginForeignInsert to support that, it is a simple exercise
for it to provide an empty BeginForeignInsert just to signal that it
wants to support COPY.


That seems to me inconsistent with the concept of the existing APIs for 
updating foreign tables, because for an FDW that wants to support 
INSERT/UPDATE/DELETE and has no need for 
PlanForeignModify/BeginForeignModify, those APIs don't require the FDW 
to provide empty PlanForeignModify/BeginForeignModify to tell the core 
that it wants to support INSERT/UPDATE/DELETE.


Best regards,
Etsuro Fujita





Re: Optimizer items in the release notes

2019-04-24 Thread Adrien NAYRAT

On 4/22/19 6:54 PM, Bruce Momjian wrote:

We had a discussion in October about adding more optimizer items to the
release notes:


https://www.postgresql.org/message-id/flat/20181010220601.GA7807%40momjian.us#11d805ea0b0fcd0552dfa99251417cc1

There was no agreement on a change, but if people want to propose a
change, please post here and we can discuss it.



Hello,

Thanks Bruce to start this thread.

I still think it is useful to mention changes in the optimizer for 
several reasons:


- help to understand why a plan can change between different majors 
versions. I don't have any example but an improvement for some users 
could be a regression for others.
- knowing optimizer improvements can motivate users to upgrade to newer 
versions

- help to find bug between majors versions


Regards,




Re: standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-24 Thread Kyotaro HORIGUCHI
Mmm. I posted to wrong thread. Sorry.

At Tue, 23 Apr 2019 16:39:49 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190423.163949.36763221.horiguchi.kyot...@lab.ntt.co.jp>
> At Tue, 23 Apr 2019 13:31:58 +0800, Paul Guo  wrote in 
> 
> > Hi Kyotaro, ignoring the MakePGDirectory() failure will fix this database
> > create redo error, but I suspect some other kind of redo, which depends on
> > the files under the directory (they are not copied since the directory is
> > not created) and also cannot be covered by the invalid page mechanism,
> > could fail. Thanks.
> 
> If recovery starts from just after tablespace creation, that's
> simple. The Symlink to the removed tablespace is already removed
> in the case. Hence server innocently create files directly under
> pg_tblspc, not in the tablespace. Finally all files that were
> supposed to be created in the removed tablespace are removed
> later in recovery.
> 
> If recovery starts from recalling page in a file that have been
> in the tablespace, XLogReadBufferExtended creates one (perhaps
> directly in pg_tblspc as described above) and the files are
> removed later in recoery the same way to above. This case doen't
> cause FATAL/PANIC during recovery even in master.
> 
> XLogReadBufferExtended@xlogutils.c
> | * Create the target file if it doesn't already exist.  This lets us cope
> | * if the replay sequence contains writes to a relation that is later
> | * deleted.  (The original coding of this routine would instead suppress
> | * the writes, but that seems like it risks losing valuable data if the
> | * filesystem loses an inode during a crash.  Better to write the data
> | * until we are actually told to delete the file.)
> 
> So buffered access cannot be a problem for the reason above. The
> remaining possible issue is non-buffered access to files in
> removed tablespaces. This is what I mentioned upthread:
> 
> me> but I haven't checked this covers all places where need the same
> me> treatment.

RM_DBASE_ID is fixed by the patch.

XLOG/XACT/CLOG/MULTIXACT/RELMAP/STANDBY/COMMIT_TS/REPLORIGIN/LOGICALMSG:
  - are not relevant.

HEAP/HEAP2/BTREE/HASH/GIN/GIST/SEQ/SPGIST/BRIN/GENERIC:
  - Resources works on buffer is not affected.

SMGR:
  - Both CREATE and TRUNCATE seems fine.

TBLSPC:
  - We don't nest tablespace directories. No Problem.

I don't find a similar case.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




>From bc97e195f21af5d715d85424efc21fcbe8bb902c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 Apr 2019 20:59:15 +0900
Subject: [PATCH 4/5] Fix failure of standby startup caused by tablespace
 removal

When standby restarts after a crash after drop of a tablespace,
there's a possibility that recovery fails trying an object-creation in
already removed tablespace directory. Allow recovery to continue by
ignoring the error if not reaching consistency point.
---
 src/backend/access/transam/xlogutils.c | 34 ++
 src/backend/commands/tablespace.c  | 12 ++--
 src/backend/storage/file/copydir.c | 12 +++-
 src/include/access/xlogutils.h |  1 +
 4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663bae6..75cdb882cd 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -522,6 +522,40 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	return buffer;
 }
 
+/*
+ * XLogMakePGDirectory
+ *
+ * There is a possibility that WAL replay causes an error by creation of a
+ * directory under a directory removed before the previous crash. Issuing
+ * ERROR prevents the caller from continuing recovery.
+ *
+ * To prevent that case, this function issues WARNING instead of ERROR on
+ * error if consistency is not reached yet.
+ */
+int
+XLogMakePGDirectory(const char *directoryName)
+{
+	int ret;
+
+	ret = MakePGDirectory(directoryName);
+
+	if (ret != 0)
+	{
+		int elevel = ERROR;
+
+		/* Don't issue ERROR for this failure before reaching consistency. */
+		if (InRecovery && !reachedConsistency)
+			elevel = WARNING;
+
+		ereport(elevel,
+(errcode_for_file_access(),
+ errmsg("could not create directory \"%s\": %m", directoryName)));
+		return ret;
+	}
+
+	return 0;
+}
+
 /*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 66a70871e6..c9fb2af015 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -303,12 +303,6 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 (errcode(ERRCODE_INVALID_NAME),
  errmsg("tablespace location cannot contain single quotes")));
 
-	/* Reject tablespaces in the data directory. */
-	if (is_in_data_directory(location))
-		ereport(ERROR,
-

Re: bug in update tuple routing with foreign partitions

2019-04-24 Thread Amit Langote
Fujita-san,

On 2019/04/24 18:55, Etsuro Fujita wrote:
> (2019/04/23 10:03), Amit Langote wrote:
>> Thanks for adding me as an author, but I think the latest
>> patch is mostly your work, so I'm happy to be listed as just a reviewer. :)
> 
> You found this bug, analyzed it, and wrote the first version of the
> patch.  I heavily modified the patch, but I used your test case, so I
> think you deserve the first author of this fix.

OK, thanks.

Regards,
Amit





Re: bug in update tuple routing with foreign partitions

2019-04-24 Thread Etsuro Fujita

(2019/04/23 10:03), Amit Langote wrote:

So, it seems to OK to keep this commit this as one patch.



I read your commit message and it seems to sufficiently explain the issues
being fixed.


Cool!


Thanks for adding me as an author, but I think the latest
patch is mostly your work, so I'm happy to be listed as just a reviewer. :)


You found this bug, analyzed it, and wrote the first version of the 
patch.  I heavily modified the patch, but I used your test case, so I 
think you deserve the first author of this fix.



I don't have any more comments.  Thanks for working on this.


Pushed.  Many thanks, Amit-san!

Best regards,
Etsuro Fujita





Re: [PATCH v1] Show whether tables are logged in \dt+

2019-04-24 Thread Fabien COELHO



Hello David,


I noticed that there wasn't a bulk way to see table logged-ness in psql,
so I made it part of \dt+.


Applies, compiles, works for me.

ISTM That temporary-ness is not shown either. Maybe the persistence column
should be shown as is?


Temporariness added, but not raw.


Ok, it is better like this way.


Tests?


Included, but they're not stable for temp tables. I'm a little stumped
as to how to either stabilize them or test some other way.


Hmmm. First there is the username which appears, so there should be a 
dedicated user for the test.


I'm unsure how to work around the temporary schema number, which is 
undeterministic with parallel execution it. I'm afraid the only viable 
approach is not to show temporary tables, too bad:-(



Doc?


What further documentation does it need?


Indeed, there is no precise doc, so nothing to update :-)/:-(


Maybe you could consider adding a case for prior 9.1 version, something 
like:

  ... case c.relistemp then 'temporary' else 'permanent' end as ...

--
Fabien.




Re: Regression test PANICs with master-standby setup on same machine

2019-04-24 Thread Kyotaro HORIGUCHI
Hello.

At Wed, 24 Apr 2019 13:23:04 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20190424.132304.40676137.horiguchi.kyot...@lab.ntt.co.jp>
> > > with a check that forces relative paths to be outside of PGDATA (baring
> > > symlinks). As far as I can tell convert_and_check_filename() would check
> > > just about the opposite.
> 
> We need to adjust relative path between PGDATA-based and
> pg_tblspc based. The attached first patch does that.
> 
> - I'm not sure it is OK to use getcwd this way. Another issue
>   here is is_in_data_directory canonicalizes DataDir
>   on-the-fly. It is needed when DataDir contains '/./' or such. I
>   think the canonicalization should be done far earlier.
> 
> The second attached is TAP change to support tablespaces using
> relative tablespaces.
> 
> - This is tentative or sample. I'll visit the current discussion thread.
> 
> The third is test for this issue.
> 
> - Tablespace handling gets easier.
> 
> The fourth is the fix for the issue here.
> 
> - Not all possible simliar issue is not checked.

This is new version. Adjusted pg_basebackup's behavior to allow
relative mappings. But..

This is apparently out of a bug fix.  What should I do with it?

Should we applying only 0004 (after further checking) or
something as bug fix, then register the rest for v13?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From 16b158756a8e51279604bf9f8995b9392052b274 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Wed, 24 Apr 2019 11:50:41 +0900
Subject: [PATCH 1/5] Allow relative tablespace location paths

Currently relative paths are not allowed as tablespace directory. That
makes tests about tablespaces bothersome but doens't prevent them
points into the data directory perfectly. This patch allows relative
direcotry based on the data directory as tablespace directory. Then
makes the check more strict.
---
 src/backend/access/transam/xlog.c|   8 ++
 src/backend/commands/tablespace.c|  76 ---
 src/backend/replication/basebackup.c |   7 ++
 src/backend/utils/adt/misc.c |   7 ++
 src/bin/pg_basebackup/pg_basebackup.c| 180 +++
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  17 ++-
 src/test/regress/input/tablespace.source |   3 +
 src/test/regress/output/tablespace.source|   5 +-
 8 files changed, 258 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c00b63c751..abc2fd951f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10419,6 +10419,14 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			}
 			linkpath[rllen] = '\0';
 
+			/*
+			 * In the relative path cases, the link target is always prefixed
+			 * by "../" to convert from data directory-based. So we just do
+			 * the reverse here.
+			 */
+			if (strncmp(s, "../", 3) == 0)
+s += 3;
+
 			/*
 			 * Add the escape character '\\' before newline in a string to
 			 * ensure that we can distinguish between the newline in the
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 3784ea4b4f..66a70871e6 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -94,6 +94,42 @@ static void create_tablespace_directories(const char *location,
 			  const Oid tablespaceoid);
 static bool destroy_tablespace_directories(Oid tablespaceoid, bool redo);
 
+/*
+ * Check if the path is in the data directory strictly.
+ */
+static bool
+is_in_data_directory(const char *path)
+{
+	char cwd[MAXPGPATH];
+	char abspath[MAXPGPATH];
+	char absdatadir[MAXPGPATH];
+
+	getcwd(cwd, MAXPGPATH);
+	if (chdir(path) < 0)
+		ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("invalid directory \"%s\": %m", path)));
+
+	/* getcwd is defined as returning absolute path */
+	getcwd(abspath, MAXPGPATH);
+
+	/* DataDir needs to be canonicalized */
+	if (chdir(DataDir))
+		ereport(FATAL,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not chdir to the data directory \"%s\": %m",
+		DataDir)));
+	getcwd(absdatadir, MAXPGPATH);
+
+	/* this must succeed */
+	if (chdir(cwd))
+		ereport(FATAL,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("could not chdir to the current working directory \"%s\": %m",
+	 cwd)));
+
+	return path_is_prefix_of_path(absdatadir, abspath);
+}
 
 /*
  * Each database using a table space is isolated into its own name space
@@ -267,35 +303,26 @@ CreateTableSpace(CreateTableSpaceStmt *stmt)
 (errcode(ERRCODE_INVALID_NAME),
  errmsg("tablespace location cannot contain single quotes")));
 
-	/*
-	 * Allowing relative paths seems risky
-	 *
-	 * this also helps us ensure that location is not empty or whitespace
-	 */
-	if (!is_absolute_path(location))
+	/* Reject tablespaces in the data directory. */
+	if 

Re: pg_dump partitions can lead to inconsistent state after restore

2019-04-24 Thread David Rowley
On Wed, 24 Apr 2019 at 14:53, Amit Langote
 wrote:
>
> On 2019/04/24 10:19, David Rowley wrote:
> > ERROR:  invalid input syntax for type integer: "One"
> > LINE 1: INSERT INTO public.listp1 VALUES ('One', 1);
> >
> > That settles the debate on the other thread...
>
> +1 to fixing this, although +0.5 to back-patching.
>
> The reason no one has complained so far of being bitten by this may be
> that, as each of one us has said at least once on the other thread, users
> are not very likely to create partitions with different column orders to
> begin with.  Maybe, that isn't a reason to leave it as is though.

Well, you could probably class most of the bugs that make their way
through feature freeze, alpha and beta as unlikely.  I don't think
that gives us an excuse to leave them as bugs.  If someone reported it
we'd most likely go and fix it then anyway, so I really don't see the
point in waiting until then.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Symbol referencing errors

2019-04-24 Thread Li Japin
Hi,

Finally, I find this crash is caused by shmget_osm, which does not support 
SmartOS (maybe,
I am not sure). When I install Oracle Instant Client 12.2.0.1.0, it works.

https://github.com/laurenz/oracle_fdw/issues/313


On 4/23/19 3:09 PM, Laurenz Albe wrote:

On Tue, 2019-04-23 at 04:26 +, Li Japin wrote:


Yes, those errors does not impact the postgresql, but when
I use oracle_fdw extension, I couldn't startup the postgresql,
and I find that the dlopen throw an error which lead postmaster
exit, and there is not more information.


That may wall be a bug in oracle_fdw, since I have no reports of
anybody running it on that operating system.

Maybe you should open an oracle_fdw issue, but I don't know how
much I can help you, since this is the first time I have heard
of SmartOS.



Re: [PATCH v1] Show whether tables are logged in \dt+

2019-04-24 Thread David Fetter
On Tue, Apr 23, 2019 at 07:03:58AM +0200, Fabien COELHO wrote:
> 
> Hello David,
> 
> > I noticed that there wasn't a bulk way to see table logged-ness in psql,
> > so I made it part of \dt+.
> 
> Applies, compiles, works for me.
> 
> ISTM That temporary-ness is not shown either. Maybe the persistence column
> should be shown as is?

Temporariness added, but not raw.

> Also I'd suggest that the column should be displayed before the
> "description" column to keep the length-varying one last?

Done.

> > What say?
> 
> Tests?

Included, but they're not stable for temp tables. I'm a little stumped
as to how to either stabilize them or test some other way.

> Doc?

What further documentation does it need?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 9e82160c2fe2f554e3417b3fbff147bc6024e3fe Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 22 Apr 2019 17:50:48 -0700
Subject: [PATCH v2] Show table persistence in \dt+
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.20.1"

This is a multi-part message in MIME format.
--2.20.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


\d would show this for individual tables, but there wasn't an
overarching view of all tables. Now, there is.

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 5ef567c123..02e7eb0862 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3679,6 +3679,13 @@ listTables(const char *tabtypes, const char *pattern, bool verbose, bool showSys
 
 	if (verbose)
 	{
+		/*
+		 * Show whether the table is permanent, temporary, or unlogged.
+		 */
+		if (pset.sversion >= 91000)
+			appendPQExpBuffer(,
+			  ",\n  case c.relpersistence when 'p' then 'permanent' when 't' then 'temporary' when 'u' then 'unlogged' else 'unknown' end as \"%s\"",
+			  gettext_noop("Persistence"));
 		/*
 		 * As of PostgreSQL 9.0, use pg_table_size() to show a more accurate
 		 * size of a table, including FSM, VM and TOAST tables.
diff --git a/src/test/regress/expected/psql.out b/src/test/regress/expected/psql.out
index 35856bffdd..f4f468804f 100644
--- a/src/test/regress/expected/psql.out
+++ b/src/test/regress/expected/psql.out
@@ -4729,3 +4729,16 @@ drop schema testpart;
 set search_path to default;
 set role to default;
 drop role testrole_partitioning;
+create table foo(id integer);
+create temp table tfoo(id integer);
+create unlogged table ufoo(id integer);
+\dt+ *.*foo
+List of relations
+  Schema   | Name | Type  |  Owner  | Persistence |  Size   | Description 
+---+--+---+-+-+-+-
+ pg_temp_3 | tfoo | table | shackle | temporary   | 0 bytes | 
+ public| foo  | table | shackle | permanent   | 0 bytes | 
+ public| ufoo | table | shackle | unlogged| 0 bytes | 
+(3 rows)
+
+drop table foo, tfoo, ufoo;
diff --git a/src/test/regress/sql/psql.sql b/src/test/regress/sql/psql.sql
index 78f4b5d7d5..c9ad6ffd9c 100644
--- a/src/test/regress/sql/psql.sql
+++ b/src/test/regress/sql/psql.sql
@@ -1115,3 +1115,10 @@ set search_path to default;
 
 set role to default;
 drop role testrole_partitioning;
+
+set search_path = public, pg_temp;
+create table foo(id integer);
+create temp table tfoo(id integer);
+create unlogged table ufoo(id integer);
+\dt+ *.*foo
+drop table foo, tfoo, ufoo;

--2.20.1--




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-24 Thread Amit Kapila
On Tue, Apr 23, 2019 at 10:59 PM Andres Freund  wrote:
> > > On 2019-04-22 18:49:44 +0530, Amit Kapila wrote:
> > > >  /*
> > > > @@ -1132,9 +1110,6 @@ fsm_local_set(Relation rel, BlockNumber 
> > > > cur_nblocks)
> > > >   /* Set the status of the cached target block to 'unavailable'. */
> > > >   cached_target_block = RelationGetTargetBlock(rel);
> > > >   if (cached_target_block != InvalidBlockNumber &&
> > > >   cached_target_block < cur_nblocks)
> > > > - fsm_local_map.map[cached_target_block] = 
> > > > FSM_LOCAL_NOT_AVAIL;
> > > > + rel->fsm_local_map->map[cached_target_block] = 
> > > > FSM_LOCAL_NOT_AVAIL;
> > > >  }
> > >
> > > I think there shouldn't be any need for this anymore. After this change
> > > we shouldn't invalidate the map until there's no space on it - thereby
> > > addressing the cost overhead, and greatly reducing the likelihood that
> > > the local FSM can lead to increased bloat.
>

I have removed the code that was invalidating cached target block from
the above function.

> > If we invalidate it only when there's no space on the page, then when
> > should we set it back to available, because if we don't do that, then
> > we might miss the space due to concurrent deletes.
>
> Well, deletes don't traditionally (i.e. with an actual FSM) mark free
> space as available (for heap). I think RecordPageWithFreeSpace() should
> issue a invalidation if there's no FSM, and the block goes from full to
> empty (as there's no half-full IIUC).
>

Sure, we can do that.

> > > >  /*
> > > > @@ -1168,18 +1143,18 @@ fsm_local_set(Relation rel, BlockNumber 
> > > > cur_nblocks)
> > > >   * This function is used when there is no FSM.
> > > >   */
> > > >  static BlockNumber
> > > > -fsm_local_search(void)
> > > > +fsm_local_search(Relation rel)
> > > >  {
> > > >   BlockNumber target_block;
> > > >
> > > >   /* Local map must be set by now. */
> > > > - Assert(FSM_LOCAL_MAP_EXISTS);
> > > > + Assert(FSM_LOCAL_MAP_EXISTS(rel));
> > > >
> > > > - target_block = fsm_local_map.nblocks;
> > > > + target_block = rel->fsm_local_map->nblocks;
> > > >   do
> > > >   {
> > > >   target_block--;
> > > > - if (fsm_local_map.map[target_block] == FSM_LOCAL_AVAIL)
> > > > + if (rel->fsm_local_map->map[target_block] == 
> > > > FSM_LOCAL_AVAIL)
> > > >   return target_block;
> > > >   } while (target_block > 0);
> > > >
> > > > @@ -1189,7 +1164,22 @@ fsm_local_search(void)
> > > >* first, which would otherwise lead to the same conclusion again 
> > > > and
> > > >* again.
> > > >*/
> > > > - FSMClearLocalMap();
> > > > + fsm_clear_local_map(rel);
> > >
> > > I'm not sure I like this. My inclination would be that we should be able
> > > to check the local fsm repeatedly even if there's no space in the
> > > in-memory representation - otherwise the use of the local FSM increases
> > > bloat.
> > >
> >
> > Do you mean to say that we always check all the pages (say 4)
> > irrespective of their state in the local map?
>
> I was wondering that, yes. But I think just issuing invalidations is the
> right approach instead, see above.
>

Righ issuing invalidations can help with that.

>
> > I think we should first try to see in this new scheme (a) when to set
> > the map, (b) when to clear it, (c) how to use.  I have tried to
> > summarize my thoughts about it, let me know what do you think about
> > the same?
> >
> > When to set the map.
> > At the beginning (the first time relation is used in the backend), the
> > map will be clear.  When the first time in the backend, we find that
> > FSM doesn't exist and the number of blocks is lesser than
> > HEAP_FSM_CREATION_THRESHOLD, we set the map for the total blocks that
> > exist at that time and mark all or alternate blocks as available.
>
> I think the alternate blocks scheme has to go. It's not defensible.
>

Fair enough, I have changed it in the attached patch.  However, I
think we should test it once the patch is ready as we have seen a
small performance regression due to that.

> And sure, leaving that aside we could store one byte per block

Hmm, I think you mean to say one-bit per block, right?

> - it's
> just not what the patch has done so far (or rather, it used one byte per
> block, but only utilized one bit of it).

Right, I think this is an independently useful improvement, provided
it doesn't have any additional overhead or complexity.

>  It's possible that'd come with
> some overhead - I've not thought sufficiently about it: I assume we'd
> still start out in each backend assuming each page is empty, and we'd
> then rely on RelationGetBufferForTuple() to update that. What I wonder
> is if we'd need to check if an on-disk FSM has been created every time
> the space on a page is reduced?  I think not, if we use invalidations to
> notify others of the existance of an on-disk FSM. There's a