date:20190501

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Amit Kapila

On Tue, Apr 30, 2019 at 11:42 AM John Naylor
 wrote:
>
> On Fri, Apr 26, 2019 at 11:52 AM Amit Kapila  wrote:
> > As discussed above, we need to issue an
> > invalidation for following points:  (a) when vacuum finds there is no
> > FSM and page has more space now, I think you can detect this in
> > RecordPageWithFreeSpace
>
> I took a brief look and we'd have to know how much space was there
> before. That doesn't seem possible without first implementing the idea
> to save free space locally in the same way the FSM does. Even if we
> have consensus on that, there's no code for it, and we're running out
> of time.
>
> > (b) invalidation to notify the existence of
> > FSM, this can be done both by vacuum and backend.
>
> I still don't claim to be anything but naive in this area, but does
> the attached get us any closer?
>

@@ -776,7 +776,10 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
  if ((rel->rd_smgr->smgr_fsm_nblocks == 0 ||
  rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber) &&
  !smgrexists(rel->rd_smgr, FSM_FORKNUM))
+ {
  smgrcreate(rel->rd_smgr, FSM_FORKNUM, false);
+ fsm_clear_local_map(rel);
+ }

I think this won't be correct because when we call fsm_extend via
vacuum the local map won't be already existing, so it won't issue an
invalidation call.  Isn't it better to directly call
CacheInvalidateRelcache here to notify other backends that their local
maps are invalid now?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: POC: Cleaning up orphaned files using undo logs

2019-05-01 Thread Dilip Kumar

On Tue, Apr 30, 2019 at 11:45 AM Dilip Kumar  wrote:

The attached patch will provide mechanism for masking the necessary
bits in undo pages for supporting consistency checking for the undo
pages.  Ideally we can merge this patch with the main interface patch
but currently I have kept it separate for mainly because a) this is
still a WIP patch and b) review of the changes will be easy.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


0005-undo-page-consistency-checker_WIP_v3.patch
Description: Binary data

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Amit Kapila

On Thu, May 2, 2019 at 7:36 AM John Naylor  wrote:
>
> On Wed, May 1, 2019 at 11:24 PM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-04-18 14:10:29 -0700, Andres Freund wrote:
> > > My compromise suggestion would be to try to give John and Amit ~2 weeks
> > > to come up with a cleanup proposal, and then decide whether to 1) revert
> > > 2) apply the new patch, 3) decide to live with the warts for 12, and
> > > apply the patch in 13. As we would already have a patch, 3) seems like
> > > it'd be more tenable than without.
> >
> > I think decision time has come. My tentative impression is that we're
> > not there yet,

You are right that patch is not in committable shape, but the patch to
move the map to relcache is presented and the main work left there is
to review/test and add the invalidation calls as per discussion.   It
is just that I don't want to that in haste leading to some other
problems.  So, that patch should not take too much time and will
resolve the main complaint.  Basically, I was planning to re-post that
patch as the discussion concludes between me and Alvaro and then
probably you can also look into it once to see if that addresses the
main complaint.  There are a few other points for which John has
prepared a patch and that might need some work based on your inputs.

>> and should revert the improvements in v12, and apply the
> > improved version early in v13. As a second choice, we should live with
> > the current approach, if John and Amit "promise" further effort to clean
> > this up for v13.
>
> Yes, the revised approach is not currently as mature as the one in
> HEAD. It's not ready. Not wanting to attempt Promise Driven
> Development, I'd rather revert, and only try again if there's enough
> time and interest.
>

I can certainly help with moving patch (for cleanup) forward.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Amit Kapila

On Thu, May 2, 2019 at 3:42 AM Alvaro Herrera  wrote:
>
> On 2019-May-01, Amit Kapila wrote:
>
> > On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera  
> > wrote:
>
> > > Hmm ... so, if vacuum runs and frees up any space from any of the pages,
> > > then it should send out an invalidation -- it doesn't matter what the
> > > FSM had, just that there is more free space now.  That means every other
> > > process will need to determine a fresh FSM,
> >
> > I think you intend to say the local space map because once FSM is
> > created we will send invalidation and we won't further build relcache
> > entry having local space map.
>
> Yeah, I mean the map that records free space.
>
> > > but that seems correct.  Sounds better than keeping outdated entries
> > > indicating no-space-available.
> >
> > Agreed, but as mentioned in one of the above emails, I am also bit
> > scared that it should not lead to many invalidation messages for small
> > relations, so may be we should send the invalidation message only when
> > the entire page is empty.
>
> I don't think that's a concern, is it?  You typically won't be running
> multiple vacuums per second, or even multiple vacuums per minute.
>

That's right.  So let's try by adding invalidation call whenever space
is reduced.  Is there a good way to test whether the new invalidation
calls added by this patch has any significant impact?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread David Rowley

On Thu, 2 May 2019 at 13:08, Tom Lane  wrote:
>
> David Rowley  writes:
> > On Thu, 2 May 2019 at 06:18, Sergei Kornilov  wrote:
> >> PS: not think this is blocker for v12
>
> > Probably not a blocker, but now does not seem like an unreasonable
> > time to lower these INFO messages down to DEBUG.
>
> Not a blocker perhaps, but it's better if we can get new behavior to
> be more or less right the first time.

It's not really new behaviour though. The code in question is for
ATTACH PARTITION and was added in c31e9d4bafd (back in 2017).
bbb96c3704c is the commit for using constraints to speed up SET NOT
NULL. The mention of using the constraint for proof was made DEBUG1 in
the initial commit.  What we need to decide is if we want to make
ATTACH PARTITION's INFO message a DEBUG1 in pg12 or not.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 22:01:53 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Well, as I said before, I think hiding the to-be-rebuilt index from the
> > list of indexes is dangerous too - if somebody added an actual
> > CatalogUpdate/Insert (rather than inplace_update) anywhere along the
> > index_build() path, we'd not get an assertion failure anymore, but just
> > an index without the new entry. And given the fragility with HOT hiding
> > that a lot of the time, that seems dangerous to me.
> 
> I think that argument is pretty pointless considering that "REINDEX TABLE
> pg_class" does it this way, and that code is nearly old enough to
> vote.

IMO the reindex_relation() case isn't comparable. By my read the main
purpose there is to prevent inserting into not-yet-rebuilt indexes. The
relevant comment says:
 *   If we are processing pg_class itself, we want to make sure
 * that the updates do not try to insert index entries into indexes we
 * have not processed yet.  (When we are trying to recover from 
corrupted
 * indexes, that could easily cause a crash.)

Note the *not processed yet* bit.  That's *not* comparable logic to
hiding the index that *already* has been rebuilt, in the middle of
reindex_index().  Yes, the way reindex_relation() is currently coded,
the RelationSetIndexList() *also* hides the already rebuilt index, but
that's hard for reindex_relation() to avoid, because it's outside of
reindex_index().

> +  * If we are doing one index for reindex_relation, then we will find 
> that
> +  * the index is already not present in the index list.  In that case we
> +  * don't have to do anything to the index list here, which we mark by
> +  * clearing is_pg_class.
>*/

> - RelationSetNewRelfilenode(iRel, persistence);
> + is_pg_class = (RelationGetRelid(heapRelation) == RelationRelationId);
> + if (is_pg_class)
> + {
> + allIndexIds = RelationGetIndexList(heapRelation);
> + if (list_member_oid(allIndexIds, indexId))
> + {
> + otherIndexIds = list_delete_oid(list_copy(allIndexIds), 
> indexId);
> + /* Ensure rd_indexattr is valid; see comments for 
> RelationSetIndexList */
> + (void) RelationGetIndexAttrBitmap(heapRelation, 
> INDEX_ATTR_BITMAP_ALL);
> + }
> + else
> + is_pg_class = false;
> + }

That's not pretty either :(

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread John Naylor

On Wed, May 1, 2019 at 11:24 PM Andres Freund  wrote:
>
> Hi,
>
> On 2019-04-18 14:10:29 -0700, Andres Freund wrote:
> > My compromise suggestion would be to try to give John and Amit ~2 weeks
> > to come up with a cleanup proposal, and then decide whether to 1) revert
> > 2) apply the new patch, 3) decide to live with the warts for 12, and
> > apply the patch in 13. As we would already have a patch, 3) seems like
> > it'd be more tenable than without.
>
> I think decision time has come. My tentative impression is that we're
> not there yet, and should revert the improvements in v12, and apply the
> improved version early in v13. As a second choice, we should live with
> the current approach, if John and Amit "promise" further effort to clean
> this up for v13.

Yes, the revised approach is not currently as mature as the one in
HEAD. It's not ready. Not wanting to attempt Promise Driven
Development, I'd rather revert, and only try again if there's enough
time and interest.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-01 19:41:24 -0400, Tom Lane wrote:
>> OK, so per the other thread, it seems like the error recovery problem
>> isn't going to affect this directly.  However, I still don't like this
>> proposal much; the reason being that it's a rather fundamental change
>> in the API of RelationSetNewRelfilenode.  This will certainly break
>> any external callers of that function --- and silently, too.

> Couldn't we just address that by adding a new
> RelationSetNewRelfilenodeInternal() that's then wrapped by
> RelationSetNewRelfilenode() which just does
> RelationSetNewRelfilenodeInternal();CCI();?

That's just adding more ugliness ...

>> The solution I'm thinking of should have much more localized effects,
>> basically just in reindex_index and RelationSetNewRelfilenode, which is
>> why I like it better.

> Well, as I said before, I think hiding the to-be-rebuilt index from the
> list of indexes is dangerous too - if somebody added an actual
> CatalogUpdate/Insert (rather than inplace_update) anywhere along the
> index_build() path, we'd not get an assertion failure anymore, but just
> an index without the new entry. And given the fragility with HOT hiding
> that a lot of the time, that seems dangerous to me.

I think that argument is pretty pointless considering that "REINDEX TABLE
pg_class" does it this way, and that code is nearly old enough to vote.
Perhaps there'd be value in rewriting things so that we don't need
RelationSetIndexList at all, but it's not real clear to me what we'd do
instead, and in any case I don't agree with back-patching such a change.
In the near term it seems better to me to make "REINDEX INDEX
some-pg_class-index" handle this problem the same way "REINDEX TABLE
pg_class" has been doing for many years.

Attached is a draft patch for this.  It passes check-world with
xxx_FORCE_RELEASE, and gets through reindexing pg_class with
CLOBBER_CACHE_ALWAYS, but I've not completed a full CCA run.

regards, tom lane

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ce02410..1af959c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3261,6 +3261,9 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 heapRelation;
 	Oid			heapId;
 	IndexInfo  *indexInfo;
+	bool		is_pg_class;
+	List	   *allIndexIds = NIL;
+	List	   *otherIndexIds = NIL;
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
 
@@ -3331,19 +3334,52 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	}
 
 	/*
-	 * Build a new physical relation for the index. Need to do that before
-	 * "officially" starting the reindexing with SetReindexProcessing -
-	 * otherwise the necessary pg_class changes cannot be made with
-	 * encountering assertions.
+	 * RelationSetNewRelfilenode will need to update the index's pg_class row.
+	 * If we are reindexing some index of pg_class, that creates a problem;
+	 * once we call SetReindexProcessing, the index support will complain if
+	 * we try to insert a new index entry.  But we can't do that in the other
+	 * order without creating other problems.  We solve this by temporarily
+	 * removing the target index from pg_class's index list.  It won't get
+	 * updated during RelationSetNewRelfilenode, but that's fine because the
+	 * subsequent index build will include the new entry.  (There are more
+	 * comments about this in reindex_relation.)
+	 *
+	 * If we are doing one index for reindex_relation, then we will find that
+	 * the index is already not present in the index list.  In that case we
+	 * don't have to do anything to the index list here, which we mark by
+	 * clearing is_pg_class.
 	 */
-	RelationSetNewRelfilenode(iRel, persistence);
+	is_pg_class = (RelationGetRelid(heapRelation) == RelationRelationId);
+	if (is_pg_class)
+	{
+		allIndexIds = RelationGetIndexList(heapRelation);
+		if (list_member_oid(allIndexIds, indexId))
+		{
+			otherIndexIds = list_delete_oid(list_copy(allIndexIds), indexId);
+			/* Ensure rd_indexattr is valid; see comments for RelationSetIndexList */
+			(void) RelationGetIndexAttrBitmap(heapRelation, INDEX_ATTR_BITMAP_ALL);
+		}
+		else
+			is_pg_class = false;
+	}
 
-	/* ensure SetReindexProcessing state isn't leaked */
+	/*
+	 * Ensure SetReindexProcessing state isn't leaked.  (We don't have to
+	 * clean up the RelationSetIndexList state on error, though; transaction
+	 * abort knows about fixing that.)
+	 */
 	PG_TRY();
 	{
 		/* Suppress use of the target index while rebuilding it */
 		SetReindexProcessing(heapId, indexId);
 
+		/* ... and suppress updating it too, if necessary */
+		if (is_pg_class)
+			RelationSetIndexList(heapRelation, otherIndexIds);
+
+		/* Build a new physical relation for the index */
+		RelationSetNewRelfilenode(iRel, persistence);
+
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
 		index_build(heapRelation, iRel, in

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 19:41:24 -0400, Tom Lane wrote:
> I wrote:
> > Andres Freund  writes:
> >> On 2019-05-01 10:06:03 -0700, Andres Freund wrote:
> >>> I'm not sure this is the right short-term answer. Why isn't it, for now,
> >>> sufficient to do what I suggested with RelationSetNewRelfilenode() not
> >>> doing the CommandCounterIncrement(), and reindex_index() then doing the
> >>> SetReindexProcessing() before a CommandCounterIncrement()? That's like
> >>> ~10 line code change, and a few more with comments.
> 
> > That looks like a hack to me...
> 
> > The main thing I'm worried about right now is that I realized that
> > our recovery from errors in this area is completely hosed, cf
> > https://www.postgresql.org/message-id/4541.1556736...@sss.pgh.pa.us
> 
> OK, so per the other thread, it seems like the error recovery problem
> isn't going to affect this directly.  However, I still don't like this
> proposal much; the reason being that it's a rather fundamental change
> in the API of RelationSetNewRelfilenode.  This will certainly break
> any external callers of that function --- and silently, too.
> 
> Admittedly, there might not be any outside callers, but I don't really
> like that assumption for something we're going to have to back-patch.

Couldn't we just address that by adding a new
RelationSetNewRelfilenodeInternal() that's then wrapped by
RelationSetNewRelfilenode() which just does
RelationSetNewRelfilenodeInternal();CCI();?

Doesn't have to be ...Internal(), could also be
RelationBeginSetNewRelfilenode() or such.

I'm not sure why you think using CCI() for this purpose is a hack? To me
the ability to have catalog changes only take effect when they're all
done, and the system is ready for them, is one of the core purposes of
the infrastructure?

> The solution I'm thinking of should have much more localized effects,
> basically just in reindex_index and RelationSetNewRelfilenode, which is
> why I like it better.

Well, as I said before, I think hiding the to-be-rebuilt index from the
list of indexes is dangerous too - if somebody added an actual
CatalogUpdate/Insert (rather than inplace_update) anywhere along the
index_build() path, we'd not get an assertion failure anymore, but just
an index without the new entry. And given the fragility with HOT hiding
that a lot of the time, that seems dangerous to me.

Greetings,

Andres Freund

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread Tom Lane

David Rowley  writes:
> On Thu, 2 May 2019 at 06:18, Sergei Kornilov  wrote:
>> PS: not think this is blocker for v12

> Probably not a blocker, but now does not seem like an unreasonable
> time to lower these INFO messages down to DEBUG.

Not a blocker perhaps, but it's better if we can get new behavior to
be more or less right the first time.

regards, tom lane

Re: PostgreSQL Asian language support for full text search using ICU (and also updating pg_trgm)

2019-05-01 Thread Tatsuo Ishii

[redirected to hackers list since I think this topic is related to
adding new PostgreSQL feature.]

I think there's no doubt that it would be nice if PostgreSQL natively
supports Asian languages. For the first step, I briefly tested the ICU
tokenizer (ubrk_open and other functions) with Japanese, the only
Asian language I understand. The result was a little bit different
from the most popular Japanese tokenizer "Mecab" [1], but it seems I
can live with that as far as it's used for full text search
purpose. Of course more tests would be needed though.

In addition to the accuracy of tokenizing, performance is of course
important. This needs more work.

I think same studies would be needed for other Asian languages. Hope
someone who is familiar with other Asian languages volunteers to do
the task.

[1] https://taku910.github.io/mecab/

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Chanon Sajjamanochai 
Subject: PostgreSQL Asian language support for full text search using ICU (and 
also updating pg_trgm)
Date: Wed, 1 May 2019 08:55:50 +0700
Message-ID: 

>  Hello,
> 
> Currently PostgreSQL doesn't support full text search natively for many
> Asian languages such as Chinese, Japanese and others. These languages are
> used by a large portion of the population of the world.
> 
> The two key modules that could be modified to support Asian languages are
> the full text search module (including tsvector) and pg_trgm.
> 
> I would like to propose that this support be added to PostgreSQL.
> 
> For full text search, PostgreSQL could add a new parser (
> https://www.postgresql.org/docs/9.2/textsearch-parsers.html) that
> implements ICU word tokenization. This should be a lot more easier than
> before now that PostgreSQL itself already includes ICU dependencies for
> other things.
> 
> Then allow the ICU parser to be chosen at run-time (via a run-time config
> or an option to to_tsvector). That is all that is needed to support full
> text search for many more Asian languages natively in PostgreSQL such as
> Chinese, Japanese and Thai.
> 
> For example Elastic Search implements this using its ICU Tokenizer plugin:
> https://www.elastic.co/guide/en/elasticsearch/guide/current/icu-tokenizer.html
> 
> Some information about the related APIs in ICU for this are at:
> http://userguide.icu-project.org/boundaryanalysis
> 
> Another simple improvement that would give another option for searching for
> Asian languages is to add a run-time setting for pg_trgm that would tell it
> to not drop non-ascii characters, as currently it only indexes ascii
> characters and thus all Asian language characters are dropped.
> 
> I emphasize 'run-time setting' because when using PostgreSQL via a
> Database-As-A-Service service provider, most of the time it is not possible
> to change the config files, recompile sources, or add any new extensions.
> 
> PostgreSQL is an awesome project and probably the best RDBMS right now. I
> hope the maintainers consider this suggestion.
> 
> Best Regards,
> Chanon

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread David Rowley

On Thu, 2 May 2019 at 06:18, Sergei Kornilov  wrote:
>
> > I proposed that we didn't need those messages at all, which Robert pushed
> > back on, but I'd be willing to compromise by reducing them to NOTICE or
> > DEBUG level.
>
> I posted patch with such change in a separate topic: 
> https://commitfest.postgresql.org/23/2076/
>
> PS: not think this is blocker for v12

Probably not a blocker, but now does not seem like an unreasonable
time to lower these INFO messages down to DEBUG. If not, then we can
just remove the item from the open items list. I just put it there as
I didn't want it getting forgotten about as it wasn't going to be
anybody's priority to think hard about it on the final few days of the
last commitfest of the release.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Tom Lane

I wrote:
> Andres Freund  writes:
>> On 2019-05-01 10:06:03 -0700, Andres Freund wrote:
>>> I'm not sure this is the right short-term answer. Why isn't it, for now,
>>> sufficient to do what I suggested with RelationSetNewRelfilenode() not
>>> doing the CommandCounterIncrement(), and reindex_index() then doing the
>>> SetReindexProcessing() before a CommandCounterIncrement()? That's like
>>> ~10 line code change, and a few more with comments.

> That looks like a hack to me...

> The main thing I'm worried about right now is that I realized that
> our recovery from errors in this area is completely hosed, cf
> https://www.postgresql.org/message-id/4541.1556736...@sss.pgh.pa.us

OK, so per the other thread, it seems like the error recovery problem
isn't going to affect this directly.  However, I still don't like this
proposal much; the reason being that it's a rather fundamental change
in the API of RelationSetNewRelfilenode.  This will certainly break
any external callers of that function --- and silently, too.

Admittedly, there might not be any outside callers, but I don't really
like that assumption for something we're going to have to back-patch.

The solution I'm thinking of should have much more localized effects,
basically just in reindex_index and RelationSetNewRelfilenode, which is
why I like it better.

regards, tom lane

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
>> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
>> gotten a failure, your session is hosed:
>> 
>> regression=# select 1;
>> ?column? 
>> --
>> 1
>> (1 row)
>> 
>> regression=# reindex index pg_class_oid_index;
>> psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 
>> of 8192 bytes
>> regression=# select 1;
>> psql: ERROR:  could not open file "base/16384/35344": No such file or 
>> directory

> Yea, I noticed that too. Lead me astray for a while, because it
> triggered apparent REINDEX failures for pg_index too, even though not
> actually related to the reindex.

It seems that the reason we fail to recover is that we detect the error
during CommandEndInvalidationMessages, after we've updated the relcache
entry for pg_class_oid_index; of course at that point, any additional
pg_class access is going to fail because it'll try to use the
not-yet-existent index.  However, when we then abort the transaction,
we fail to revert pg_class_oid_index's relcache entry because *there is
not yet an invalidation queue entry telling us to do so*.

This is, therefore, an order-of-operations bug in
CommandEndInvalidationMessages.  It should attach the "current command"
queue entries to PriorCmdInvalidMsgs *before* it processes them, not
after.  In this way they'll be available to be reprocessed by 
AtEOXact_Inval or AtEOSubXact_Inval if we fail during CCI.

(A different way we could attack this is to make AtEOXact_Inval and
AtEOSubXact_Inval process "current" messages, as they do not today.
But I think that's a relatively inefficient solution, as it would
force those messages to be reprocessed even in the much-more-common
case where we abort before reaching CommandEndInvalidationMessages.)

The attached quick-hack patch changes that around, and seems to survive
testing (though I've not completed a CCA run with it).  The basic change
I had to make to make this work is to make AppendInvalidationMessageList
actually append the current-commands list to the prior-cmds list, not
prepend as it's really always done.  (In that way, we can still scan the
current-commands list just by starting at its head.)  The comments for
struct InvalidationChunk explicitly disclaim any significance to the order
of the entries, so this should be okay, and I'm not seeing any problem
in testing.  It's been a long time since I wrote that code, but I think
the reason I did it like that was the thought that in a long transaction,
the prior-cmds list might be much longer than the current-commands list,
so iterating down the prior-cmds list could add an O(N^2) cost.  The
easy answer to that, of course, is to make the lists doubly-linked,
and I'd do that before committing; this version is just for testing.
(It's short on comments, too.)

The thing I was worried about in RelationCacheInvalidate does seem
to be a red herring, at least fixing it is not necessary to make
the broken-session-state problem go away.

Comments?

regards, tom lane

diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c
index f09e3a9..8894bdf 100644
*** a/src/backend/utils/cache/inval.c
--- b/src/backend/utils/cache/inval.c
*** AddInvalidationMessage(InvalidationChunk
*** 264,287 
  }

  /*
!  * Append one list of invalidation message chunks to another, resetting
!  * the source chunk-list pointer to NULL.
   */
  static void
  AppendInvalidationMessageList(InvalidationChunk **destHdr,
  			  InvalidationChunk **srcHdr)
  {
! 	InvalidationChunk *chunk = *srcHdr;

! 	if (chunk == NULL)
  		return;	/* nothing to do */

! 	while (chunk->next != NULL)
! 		chunk = chunk->next;

! 	chunk->next = *destHdr;

! 	*destHdr = *srcHdr;

  	*srcHdr = NULL;
  }
--- 264,294 
  }

  /*
!  * Append the "source" list of invalidation message chunks to the "dest"
!  * list, resetting the source chunk-list pointer to NULL.
!  * Note that if the caller hangs onto the previous source pointer,
!  * the source list is still separately traversable afterwards.
   */
  static void
  AppendInvalidationMessageList(InvalidationChunk **destHdr,
  			  InvalidationChunk **srcHdr)
  {
! 	InvalidationChunk *chunk;

! 	if (*srcHdr == NULL)
  		return;	/* nothing to do */

! 	chunk = *destHdr;

! 	if (chunk == NULL)
! 		*destHdr = *srcHdr;
! 	else
! 	{
! 		while (chunk->next != NULL)
! 			chunk = chunk->next;

! 		chunk->next = *srcHdr;
! 	}

  	*srcHdr = NULL;
  }
*** xactGetCommittedInvalidationMessages(Sha
*** 858,867 
  	 */
  	oldcontext = MemoryContextSwitchTo(CurTransactionContext);

- 	ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs,
- 	 MakeSharedInvalidMessagesArray);
  	ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs,
  	 M

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Alvaro Herrera

On 2019-May-01, Amit Kapila wrote:

> On Tue, Apr 30, 2019 at 7:52 PM Alvaro Herrera  
> wrote:

> > Hmm ... so, if vacuum runs and frees up any space from any of the pages,
> > then it should send out an invalidation -- it doesn't matter what the
> > FSM had, just that there is more free space now.  That means every other
> > process will need to determine a fresh FSM,
> 
> I think you intend to say the local space map because once FSM is
> created we will send invalidation and we won't further build relcache
> entry having local space map.

Yeah, I mean the map that records free space.

> > but that seems correct.  Sounds better than keeping outdated entries
> > indicating no-space-available.
> 
> Agreed, but as mentioned in one of the above emails, I am also bit
> scared that it should not lead to many invalidation messages for small
> relations, so may be we should send the invalidation message only when
> the entire page is empty.

I don't think that's a concern, is it?  You typically won't be running
multiple vacuums per second, or even multiple vacuums per minute.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: to_timestamp docs

2019-05-01 Thread Arthur Zakirov

On Thu, May 2, 2019 at 12:49 AM Alexander Korotkov
 wrote:
> > It works globally (but only for remaining string if you don't put it
> > at the beginning)
> > and you can set it only once. For example:
> >
> > =# SELECT to_timestamp('JUL   JUL JUL','FXMON_MON_MON');
> > ERROR:  invalid value "  J" for "MON"
> > DETAIL:  The given value did not match any of the allowed values for this 
> > field.
>
> Actually, FX takes effect on subsequent format patterns.  This is not
> documented, but it copycats Oracle behavior.  Sure, normally FX should
> be specified as the first item.  We could document current behavior or
> restrict specifying FX not as first item.  This is also not new in 12,
> so documenting current behavior is better for compatibility.

I went to Oracle's documentation. It seems that the behavior is
slightly different.
Their documentation says:

"A modifier can appear in a format model more than once. In such a case,
each subsequent occurrence toggles the effects of the modifier. Its effects are
enabled for the portion of the model following its first occurrence, and then
disabled for the portion following its second, and then reenabled for
the portion
following its third, and so on."

In PostgreSQL one cannot disable exact mode using second FX. I think we
shouldn't add some restriction for FX. Instead PostgreSQL's documentation
can be fixed. And current explanation in the documentation might be wrong as
Bruce pointed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: to_timestamp docs

2019-05-01 Thread Bruce Momjian

On Thu, May  2, 2019 at 12:49:23AM +0300, Alexander Korotkov wrote:
> On Wed, May 1, 2019 at 11:20 PM Arthur Zakirov  
> wrote:
> > Hello,
> > Not sure if we need some additional checks here if FX is set.
> 
> I'd like to add that this behavior is not new in 12.  It was the same before.

Agreed, but since we are looking at it, let's document it.

> > > It seems DD and  (as numerics?) in FX mode eat trailing whitespace,
> > > while MON does not?  Also, I used these queries to determine it is
> > > "trailing" whitespace that "FXMON" controls:
> > >
> > > SELECT to_timestamp('JUL   JUL JUL','MON_FXMON_MON');
> > >   to_timestamp
> > > -
> > >  0001-07-01 00:00:00-04:56:02 BC
> > >
> > > SELECT to_timestamp('JUL JUL   JUL','MON_FXMON_MON');
> > > ERROR:  invalid value "  J" for "MON"
> > > DETAIL:  The given value did not match any of the allowed values 
> > > for this field.
> >
> > The problem here is that you need to specify FX only once and at beginning 
> > of
> > the format string. It is stated in the documentation:
> >
> > "FX must be specified as the first item in the template."
> >
> > It works globally (but only for remaining string if you don't put it
> > at the beginning)
> > and you can set it only once. For example:
> >
> > =# SELECT to_timestamp('JUL   JUL JUL','FXMON_MON_MON');
> > ERROR:  invalid value "  J" for "MON"
> > DETAIL:  The given value did not match any of the allowed values for this 
> > field.
> 
> Actually, FX takes effect on subsequent format patterns.  This is not
> documented, but it copycats Oracle behavior.  Sure, normally FX should
> be specified as the first item.  We could document current behavior or
> restrict specifying FX not as first item.  This is also not new in 12,
> so documenting current behavior is better for compatibility.

Agreed.  Since is it pre-12 behavior, I suggest we just document it and
not change it.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: to_timestamp docs

2019-05-01 Thread Alexander Korotkov

On Wed, May 1, 2019 at 11:20 PM Arthur Zakirov  wrote:
> Hello,
>
> On Wed, May 1, 2019 at 6:05 PM Bruce Momjian  wrote:
> > Thanks.  I think I see the sentence you are thinking of:
> >
> >to_timestamp and to_date
> >skip multiple blank spaces at the beginning of the input string
> >and around date and time values unless the FX
> >option is used.
> >
> > However, first, it is unclear what 'skip' means here, i.e., does it mean
> > multiple blank spaces become a single space, or they are ignored.
>
> I worked at to_timestamp some time ago. In this case multiple bank spaces at
> the beginning should be ignored.
>
> > Second, I see inconsistent behaviour around the use of FX for various
> > patterns, e.g.:
> >
> > SELECT to_timestamp('5   1976','FXDD_FX');
> >   to_timestamp
> > 
> >  1976-01-05 00:00:00-05
>
> Hm, I think strspace_len() is partly to blame here, which is called by
> from_char_parse_int_len():
>
> /*
>  * Skip any whitespace before parsing the integer.
>  */
> *src += strspace_len(*src);
>
> But even if you remove this line of code then strtol() will eat
> survived whitespaces:
>
> result = strtol(init, src, 10);
>
> Not sure if we need some additional checks here if FX is set.

I'd like to add that this behavior is not new in 12.  It was the same before.

> > It seems DD and  (as numerics?) in FX mode eat trailing whitespace,
> > while MON does not?  Also, I used these queries to determine it is
> > "trailing" whitespace that "FXMON" controls:
> >
> > SELECT to_timestamp('JUL   JUL JUL','MON_FXMON_MON');
> >   to_timestamp
> > -
> >  0001-07-01 00:00:00-04:56:02 BC
> >
> > SELECT to_timestamp('JUL JUL   JUL','MON_FXMON_MON');
> > ERROR:  invalid value "  J" for "MON"
> > DETAIL:  The given value did not match any of the allowed values 
> > for this field.
>
> The problem here is that you need to specify FX only once and at beginning of
> the format string. It is stated in the documentation:
>
> "FX must be specified as the first item in the template."
>
> It works globally (but only for remaining string if you don't put it
> at the beginning)
> and you can set it only once. For example:
>
> =# SELECT to_timestamp('JUL   JUL JUL','FXMON_MON_MON');
> ERROR:  invalid value "  J" for "MON"
> DETAIL:  The given value did not match any of the allowed values for this 
> field.

Actually, FX takes effect on subsequent format patterns.  This is not
documented, but it copycats Oracle behavior.  Sure, normally FX should
be specified as the first item.  We could document current behavior or
restrict specifying FX not as first item.  This is also not new in 12,
so documenting current behavior is better for compatibility.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> There might be an argument for treating rd_createSubid the same way,
>> not sure.  That field wouldn't ever be set on a system catalog, so
>> it's less of a hazard, but there's something to be said for treating
>> the two fields alike.

> Seems sensible to treat it the same way. Not sure if there's an actual
> problem where the current treatment could cause a problem, but seems
> worth improving.

So if I try to treat rd_createSubid the same way, it passes regression
in normal mode, but CLOBBER_CACHE_ALWAYS is a total disaster: all
table creations fail, eg

regression=# CREATE TABLE BOOLTBL1 (f1 bool);
psql: ERROR:  relation 92773 deleted while still in use

What is happening is that once heap_create has created a relcache
entry for the new relation, any RelationCacheInvalidate call causes
us to try to reload that relcache entry, and that fails because there's
no pg_class or pg_attribute entries for the new relation yet.

This gets back to the thing we were poking at in the other thread,
which is whether or not a relcache entry is *always* reconstructible
from on-disk state.  I had in the back of my mind "um, what about
initial table creation?" but hadn't looked closely.  It seems that
the only reason that it works is that RelationCacheInvalidate is
unwilling to touch new-in-transaction relcache entries.

The ideal fix for this, perhaps, would be to rejigger things far enough
that we would not try to create a relcache entry for a new rel until
it was supported by some minimal set of catalog entries.  But that's
more change than I really want to undertake right now, and for sure
I wouldn't want to back-patch it.

Anyway, I believe the original justification for skipping
new-in-transaction entries here, which is that they couldn't possibly be
subjects of any cross-backend notification messages.  I think the argument
that that applies to rd_newRelfilenodeSubid is a whole lot weaker though.
But I'm going to let this go for the moment, because it's not clear whether
a patch like this is actually relevant to our immediate problem.  There
seem to be some other order-of-operations issues in invalidation
processing, and maybe fixing those would suffice.  More later.

regards, tom lane

Re: to_timestamp docs

2019-05-01 Thread Bruce Momjian

On Wed, May  1, 2019 at 11:20:05PM +0300, Arthur Zakirov wrote:
> Hello,
> 
> On Wed, May 1, 2019 at 6:05 PM Bruce Momjian  wrote:
> > Thanks.  I think I see the sentence you are thinking of:
> >
> >to_timestamp and to_date
> >skip multiple blank spaces at the beginning of the input string
> >and around date and time values unless the FX
> >option is used.
> >
> > However, first, it is unclear what 'skip' means here, i.e., does it mean
> > multiple blank spaces become a single space, or they are ignored.
> 
> I worked at to_timestamp some time ago. In this case multiple bank spaces at
> the beginning should be ignored.

OK.

> > Second, I see inconsistent behaviour around the use of FX for various
> > patterns, e.g.:
> >
> > SELECT to_timestamp('5   1976','FXDD_FX');
> >   to_timestamp
> > 
> >  1976-01-05 00:00:00-05
> 
> Hm, I think strspace_len() is partly to blame here, which is called by
> from_char_parse_int_len():
> 
> /*
>  * Skip any whitespace before parsing the integer.
>  */
> *src += strspace_len(*src);
> 
> But even if you remove this line of code then strtol() will eat
> survived whitespaces:
> 
> result = strtol(init, src, 10);
> 
> Not sure if we need some additional checks here if FX is set.

Yes, I suspected it was part of the input function, but it seems it is
done in two places.  It seems we need the opposite of strspace_len() in
that place to throw an error if we are in FX mode.

> The problem here is that you need to specify FX only once and at beginning of
> the format string. It is stated in the documentation:
> 
> "FX must be specified as the first item in the template."

Uh, FX certainly changes behavior if it isn't the first thing in the
format string.

> It works globally (but only for remaining string if you don't put it
> at the beginning)

Uh, then the documentation is wrong?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: to_timestamp docs

2019-05-01 Thread Arthur Zakirov

Hello,

On Wed, May 1, 2019 at 6:05 PM Bruce Momjian  wrote:
> Thanks.  I think I see the sentence you are thinking of:
>
>to_timestamp and to_date
>skip multiple blank spaces at the beginning of the input string
>and around date and time values unless the FX
>option is used.
>
> However, first, it is unclear what 'skip' means here, i.e., does it mean
> multiple blank spaces become a single space, or they are ignored.

I worked at to_timestamp some time ago. In this case multiple bank spaces at
the beginning should be ignored.

> Second, I see inconsistent behaviour around the use of FX for various
> patterns, e.g.:
>
> SELECT to_timestamp('5   1976','FXDD_FX');
>   to_timestamp
> 
>  1976-01-05 00:00:00-05

Hm, I think strspace_len() is partly to blame here, which is called by
from_char_parse_int_len():

/*
 * Skip any whitespace before parsing the integer.
 */
*src += strspace_len(*src);

But even if you remove this line of code then strtol() will eat
survived whitespaces:

result = strtol(init, src, 10);

Not sure if we need some additional checks here if FX is set.

> It seems DD and  (as numerics?) in FX mode eat trailing whitespace,
> while MON does not?  Also, I used these queries to determine it is
> "trailing" whitespace that "FXMON" controls:
>
> SELECT to_timestamp('JUL   JUL JUL','MON_FXMON_MON');
>   to_timestamp
> -
>  0001-07-01 00:00:00-04:56:02 BC
>
> SELECT to_timestamp('JUL JUL   JUL','MON_FXMON_MON');
> ERROR:  invalid value "  J" for "MON"
> DETAIL:  The given value did not match any of the allowed values for 
> this field.

The problem here is that you need to specify FX only once and at beginning of
the format string. It is stated in the documentation:

"FX must be specified as the first item in the template."

It works globally (but only for remaining string if you don't put it
at the beginning)
and you can set it only once. For example:

=# SELECT to_timestamp('JUL   JUL JUL','FXMON_MON_MON');
ERROR:  invalid value "  J" for "MON"
DETAIL:  The given value did not match any of the allowed values for this field.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

pg_upgrade --clone error checking

2019-05-01 Thread Jeff Janes

With the new pg_upgrade --clone, if we are going to end up throwing the
error "file cloning not supported on this platform" (which seems to depend
only on ifdefs) I think we should throw it first thing, before any other
checks are done and certainly before pg_dump gets run.

This might result in some small amount of code duplication, but I think it
would be worth the cost.

For cases where we might throw "could not clone file between old and new
data directories", I wonder if we shouldn't do some kind of dummy copy to
catch that error earlier, as well.  Maybe that one is not worth it.

Cheers,

Jeff

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 10:21:15 -0700, Andres Freund wrote:
> FWIW, the dirty-hack version (attached) of the CommandCounterIncrement()
> approach fixes the issue for a REINDEX pg_class_oid_index; in solation
> even when using CCA. Started a whole CCA testrun with it, but the
> results of that will obviously not be in quick.

Not finished yet, but it got pretty far:

parallel group (5 tests):  create_index_spgist index_including_gist 
index_including create_view create_index
 create_index ... ok   500586 ms
 create_index_spgist  ... ok86890 ms
 create_view  ... ok   466512 ms
 index_including  ... ok   150279 ms
 index_including_gist ... ok   109087 ms
test reindex_catalog  ... ok 2285 ms
parallel group (16 tests):  create_cast roleattributes drop_if_exists 
create_aggregate vacuum create_am hash_func select create_function_3 
constraints typed_table rolenames errors updatable_views triggers inherit

that's where it's at right now:

parallel group (20 tests):  init_privs security_label gin password 
drop_operator lock gist tablesample spgist

Greetings,

Andres Freund

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
>> This seems quite wrong, because it prevents us from rebuilding the
>> entry with corrected values.  In particular notice that the change
>> causes us to skip the RelationInitPhysicalAddr call that would
>> normally be applied to a nailed mapped index in that loop.  That's
>> completely fatal in this case --- it keeps us from restoring the
>> correct relfilenode that the mapper would now tell us, if we only
>> bothered to ask.

> Indeed. I'm a bit surprised that doesn't lead to more problems.

> Not sure I understand where the RelationCacheInvalidate() call is coming
> from in this case though. Shouldn't the entry have been invalidated
> individually through ATEOXact_Inval(false)?

In CLOBBER_CACHE_ALWAYS mode it's likely that we get a
RelationCacheInvalidate call first.

Note that the change I'm talking about here is not sufficient to fix
the failure; there are more problems behind it.  I just wanted to know
if there was something I was missing about that old patch.

regards, tom lane

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> On 2019-05-01 10:06:03 -0700, Andres Freund wrote:
>> I'm not sure this is the right short-term answer. Why isn't it, for now,
>> sufficient to do what I suggested with RelationSetNewRelfilenode() not
>> doing the CommandCounterIncrement(), and reindex_index() then doing the
>> SetReindexProcessing() before a CommandCounterIncrement()? That's like
>> ~10 line code change, and a few more with comments.

That looks like a hack to me...

The main thing I'm worried about right now is that I realized that
our recovery from errors in this area is completely hosed, cf
https://www.postgresql.org/message-id/4541.1556736...@sss.pgh.pa.us

The problem with CCA is actually kind of convenient for testing that,
since it means you don't have to inject any new fault to get an error
to be thrown while the index relcache entry is in the needing-to-be-
reverted state.  So I'm going to work on fixing the recovery first.
But I suspect that doing this right will require the more complicated
approach anyway.

regards, tom lane

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 14:44:12 -0400, Tom Lane wrote:
> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
> gotten a failure, your session is hosed:
> 
> regression=# select 1;
>  ?column? 
> --
> 1
> (1 row)
> 
> regression=# reindex index pg_class_oid_index;
> psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 
> of 8192 bytes
> regression=# select 1;
> psql: ERROR:  could not open file "base/16384/35344": No such file or 
> directory

Yea, I noticed that too. Lead me astray for a while, because it
triggered apparent REINDEX failures for pg_index too, even though not
actually related to the reindex.


> The problem is that the nailed relcache entry for pg_class_oid_index got
> updated to point to the new relfilenode, and that change did not get
> undone by transaction abort.  There seem to be several bugs contributing
> to that, but the one I'm asking about right now is that commit ae9aba69a
> caused RelationCacheInvalidate to skip over relations that have
> rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the
> instant of the failure.

That commit message is quite unhelpful about what exactly this was
intended to fix. I assume it was intended to make COPY FREEZE usage
predictable, but...


> This seems quite wrong, because it prevents us from rebuilding the
> entry with corrected values.  In particular notice that the change
> causes us to skip the RelationInitPhysicalAddr call that would
> normally be applied to a nailed mapped index in that loop.  That's
> completely fatal in this case --- it keeps us from restoring the
> correct relfilenode that the mapper would now tell us, if we only
> bothered to ask.

Indeed. I'm a bit surprised that doesn't lead to more problems.

Not sure I understand where the RelationCacheInvalidate() call is coming
from in this case though. Shouldn't the entry have been invalidated
individually through ATEOXact_Inval(false)?


> I think perhaps what we should do is revert ae9aba69a and instead do
> 
>   relcacheInvalsReceived++;
> 
> - if (RelationHasReferenceCountZero(relation))
> + if (RelationHasReferenceCountZero(relation) &&
> + relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
>   {
>   /* Delete this entry immediately */
>   Assert(!relation->rd_isnailed);
> 
> so that entries with rd_newRelfilenodeSubid nonzero are preserved but are
> rebuilt.

Seems like a reasonable approach.


> There might be an argument for treating rd_createSubid the same way,
> not sure.  That field wouldn't ever be set on a system catalog, so
> it's less of a hazard, but there's something to be said for treating
> the two fields alike.

Seems sensible to treat it the same way. Not sure if there's an actual
problem where the current treatment could cause a problem, but seems
worth improving.

Greetings,

Andres Freund

Re: [HACKERS] Commits 8de72b and 5457a1 (COPY FREEZE)

2019-05-01 Thread Tom Lane

[ blast-from-the-past department ]

Simon Riggs  writes:
> On 11 December 2012 03:01, Noah Misch  wrote:
>> Until these threads, I did not know that a relcache invalidation could trip 
>> up
>> the WAL avoidance optimization, and now this.  I poked at the relevant
>> relcache.c code, and it already takes pains to preserve the needed facts.  
>> The
>> header comment of RelationCacheInvalidate() indicates that entries bearing an
>> rd_newRelfilenodeSubid can safely survive the invalidation, but the code does
>> not implement that.  I think the comment is right, and this is just an
>> oversight in the code going back to its beginning (fba8113c).

> I think the comment is right also and so the patch is good. I will
> apply, barring objections.

I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found
while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've
gotten a failure, your session is hosed:

regression=# select 1;
 ?column? 
--
1
(1 row)

regression=# reindex index pg_class_oid_index;
psql: ERROR:  could not read block 0 in file "base/16384/35344": read only 0 of 
8192 bytes
regression=# select 1;
psql: ERROR:  could not open file "base/16384/35344": No such file or directory

The problem is that the nailed relcache entry for pg_class_oid_index got
updated to point to the new relfilenode, and that change did not get
undone by transaction abort.  There seem to be several bugs contributing
to that, but the one I'm asking about right now is that commit ae9aba69a
caused RelationCacheInvalidate to skip over relations that have
rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the
instant of the failure.  This seems quite wrong, because it prevents us
from rebuilding the entry with corrected values.  In particular notice
that the change causes us to skip the RelationInitPhysicalAddr call that
would normally be applied to a nailed mapped index in that loop.  That's
completely fatal in this case --- it keeps us from restoring the correct
relfilenode that the mapper would now tell us, if we only bothered to ask.

I think perhaps what we should do is revert ae9aba69a and instead do

relcacheInvalsReceived++;

-   if (RelationHasReferenceCountZero(relation))
+   if (RelationHasReferenceCountZero(relation) &&
+   relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
{
/* Delete this entry immediately */
Assert(!relation->rd_isnailed);

so that entries with rd_newRelfilenodeSubid nonzero are preserved but are
rebuilt.

There might be an argument for treating rd_createSubid the same way,
not sure.  That field wouldn't ever be set on a system catalog, so
it's less of a hazard, but there's something to be said for treating
the two fields alike.

Thoughts?

regards, tom lane

Re: Adding a test for speculative insert abort case

2019-05-01 Thread Andres Freund

Hi,

On 2019-04-30 18:34:42 -0700, Melanie Plageman wrote:
> On Tue, Apr 30, 2019 at 5:22 PM Andres Freund  wrote:
>
> >
> > Not easily so - that's why the ON CONFLICT patch didn't add code
> > coverage for it :(. I wonder if you could whip something up by having
> > another non-unique expression index, where the expression acquires a
> > advisory lock? If that advisory lock where previously acquired by
> > another session, that should allow to write a reliable isolation test?
> >
> >
> So, I took a look at one of the existing tests that does something like what
> you mentioned and tried the following:
> --
> create table t1(key int, val text);
> create unique index t1_uniq_idx on t1(key);
> create or replace function t1_lock_func(int) returns int immutable language
> sql AS
> 'select pg_advisory_xact_lock_shared(1); select $1';
> create index t1_lock_idx ON t1(t1_lock_func(key));
> --
> s1:
> begin isolation level read committed;
> insert into t1 values(1, 'someval');
> s2:
> set default_transaction_isolation = 'read committed';
> insert into t1 values(1, 'anyval') on conflict(key) do update set val =
> 'updatedval';
> --
>
> So, the above doesn't work because s2 waits to acquire the lock in the first
> phase of the speculative insert -- when it is just checking the index,
> before
> inserting to the table and before inserting to the index.

Couldn't that be addressed by having t1_lock_func() acquire two locks?
One for blocking during the initial index probe, and one for the
speculative insertion?

I'm imagining something like

if (pg_try_advisory_xact_lock(1))
pg_advisory_xact_lock(2);
else
pg_advisory_xact_lock(1);

in t1_lock_func. If you then make the session something roughly like

s1: pg_advisory_xact_lock(1);
s1: pg_advisory_xact_lock(2);

s2: upsert t1 
s1: pg_advisory_xact_unlock(1);
s2: 
s2: 
s1: insert into t1 values(1, 'someval');
s1: pg_advisory_xact_unlock(2);
s2: 
s2: spec-conflict

Greetings,

Andres Freund

Re: Adding a test for speculative insert abort case

2019-05-01 Thread Melanie Plageman

On Tue, Apr 30, 2019 at 7:14 PM Thomas Munro  wrote:

> I think it'd be nice to have a set of macros that can create wait
> points in the C code that isolation tests can control, in a special
> build.  Perhaps there could be shm hash table of named wait points in
> shared memory; if DEBUG_WAIT_POINT("foo") finds that "foo" is not
> present, it continues, but if it finds an entry it waits for it to go
> away.  Then isolation tests could add/remove names and signal a
> condition variable to release waiters.
>
> I contemplated that while working on SKIP LOCKED, which had a bunch of
> weird edge cases that I tested by inserting throw-away wait-point code
> like this:
>
>
> https://www.postgresql.org/message-id/CADLWmXXss83oiYD0pn_SfQfg%2ByNEpPbPvgDb8w6Fh--jScSybA%40mail.gmail.com
>
> Yes, I agree it would be nice to have a framework like this.

Greenplum actually has a fault injection framework that, I believe, works
similarly to what you are describing -- i.e. sets a variable in shared
memory.
There is an extension, gp_inject_fault, which allows you to set the faults.
Andreas Scherbaum wrote a blog post about how to use it [1].

The Greenplum implementation is not documented particularly well in the
code,
but, it is something that folks working on Greenplum have talked about
modifying
and proposing to Postgres.

[1] 
http://engineering.pivotal.io/post/testing_greenplum_database_using_fault_injection/

-- 
Melanie Plageman

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread Sergei Kornilov

Hi

> I proposed that we didn't need those messages at all, which Robert pushed
> back on, but I'd be willing to compromise by reducing them to NOTICE or
> DEBUG level.

I posted patch with such change in a separate topic: 
https://commitfest.postgresql.org/23/2076/

PS: not think this is blocker for v12

regards, Sergei

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Andres Freund

On 2019-05-01 10:06:03 -0700, Andres Freund wrote:
> I'm not sure this is the right short-term answer. Why isn't it, for now,
> sufficient to do what I suggested with RelationSetNewRelfilenode() not
> doing the CommandCounterIncrement(), and reindex_index() then doing the
> SetReindexProcessing() before a CommandCounterIncrement()? That's like
> ~10 line code change, and a few more with comments.
> 
> There is the danger that the current and above approach basically relies
> on there not to be any non-inplace updates during reindex.  But at the
> moment code does take care to use inplace updates
> (cf. index_update_stats()).
> 
> It's not clear to me whether the approach of using
> RelationSetIndexList() in reindex_index() would be meaningfully more
> robust against non-inplace updates during reindex either - ISTM we'd
> just as well skip the necessary index insertions if we hid the index
> being rebuilt. Skipping to-be-rebuilt indexes works for
> reindex_relation() because they're going to be rebuilt subsequently (and
> thus the missing index rows don't matter) - but it'd not work for
> reindexing a single index, because it'll not get the result at a later
> stage.

FWIW, the dirty-hack version (attached) of the CommandCounterIncrement()
approach fixes the issue for a REINDEX pg_class_oid_index; in solation
even when using CCA. Started a whole CCA testrun with it, but the
results of that will obviously not be in quick.

Greetings,

Andres Freund
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index a05b6a07ad0..6698262e202 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6108,6 +6108,7 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
 
 	/* Process xmin */
 	xid = HeapTupleHeaderGetXmin(tuple);
+	if (xid == FrozenTransactionId)
 	xmin_frozen = ((xid == FrozenTransactionId) ||
    HeapTupleHeaderXminFrozen(tuple));
 	if (TransactionIdIsNormal(xid))
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index ce024101fc9..98f26e13627 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -3344,6 +3344,8 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 		/* Suppress use of the target index while rebuilding it */
 		SetReindexProcessing(heapId, indexId);
 
+		CommandCounterIncrement();
+
 		/* Initialize the index and rebuild */
 		/* Note: we do not need to re-establish pkey setting */
 		index_build(heapRelation, iRel, indexInfo, true, true);
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index e9add1b9873..f0a4bac75f6 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -315,6 +315,7 @@ ResetSequence(Oid seq_relid)
 	 * Create a new storage file for the sequence.
 	 */
 	RelationSetNewRelfilenode(seq_rel, seq_rel->rd_rel->relpersistence);
+	CommandCounterIncrement();
 
 	/*
 	 * Ensure sequence's relfrozenxid is at 0, since it won't contain any
@@ -490,6 +491,7 @@ AlterSequence(ParseState *pstate, AlterSeqStmt *stmt)
 		 * changes transactional.
 		 */
 		RelationSetNewRelfilenode(seqrel, seqrel->rd_rel->relpersistence);
+		CommandCounterIncrement();
 
 		/*
 		 * Ensure sequence's relfrozenxid is at 0, since it won't contain any
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index f6c9cf3b0ec..eef922d8718 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -1791,6 +1791,8 @@ ExecuteTruncateGuts(List *explicit_rels, List *relids, List *relids_logged,
 table_close(toastrel, NoLock);
 			}
 
+			CommandCounterIncrement();
+
 			/*
 			 * Reconstruct the indexes to match, and we're done.
 			 */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 90ff8ccf54f..2bdfb2d5a9c 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -3541,7 +3541,7 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 	 * Make the pg_class row change visible, as well as the relation map
 	 * change if any.  This will cause the relcache entry to get updated, too.
 	 */
-	CommandCounterIncrement();
+	//CommandCounterIncrement();
 
 	/*
 	 * Mark the rel as having been given a new relfilenode in the current

Re: walsender vs. XLogBackgroundFlush during shutdown

2019-05-01 Thread Alexander Kukushkin

Hi,

On Wed, 1 May 2019 at 17:02, Tomas Vondra  wrote:

> OK, so that seems like a separate issue, somewhat unrelated to the issue I
> reported. And I'm not sure it's a walsender issue - it seems it might be a
> psycopg2 issue in not reporting the flush properly, no?

Agree, it is a different issue, but I am unsure what to blame,
postgres or psycopg2.
Right now in the psycopg2 we confirm more or less every XLogData
message, but at the same time LSN on the primary is moving forward and
we get updates with keepalive messages.
I perfectly understand the need to periodically send updates of flush
= walEnd (which comes from keepalive). It might happen that there is
no transaction activity but WAL is still generated and as a result
replication slot will prevent WAL from being cleaned up.
>From the client side perspective, it confirmed everything that it
should, but from the postgres side, this is not enough to shut down
cleanly. Maybe it is possible to change the check (sentPtr ==
replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
it would be unsafe?

> >Actually, the same problem will happen even in the case when the
> >consumer script receives some message, but not very intensively, but
> >it is just a bit harder to reproduce it.
> >
> >If you attach to the walsender with gdb, you'll see the following picture:
> >(gdb) bt
> >#0  0x7fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
> >len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
> >#1  0x55cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
> >ptr=0x55cb958dca08, len=94) at be-secure.c:318
> >#2  0x55cb93aa7b87 in secure_write (port=0x55cb958d71e0,
> >ptr=0x55cb958dca08, len=94) at be-secure.c:265
> >#3  0x55cb93ab6bf9 in internal_flush () at pqcomm.c:1433
> >#4  0x55cb93ab6b89 in socket_flush () at pqcomm.c:1409
> >#5  0x55cb93dac30b in send_message_to_frontend
> >(edata=0x55cb942b4380 ) at elog.c:3317
> >#6  0x55cb93da8973 in EmitErrorReport () at elog.c:1481
> >#7  0x55cb93da5abf in errfinish (dummy=0) at elog.c:481
> >#8  0x55cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
> >"sending replication keepalive") at elog.c:1376
> >#9  0x55cb93bcae71 in WalSndKeepalive (requestReply=true) at
> >walsender.c:3358
>
> Is it stuck in the send() call forever, or did you happen to grab
> this bracktrace?

No, it didn't stuck there. During the shutdown postgres starts sending
a few thousand keepalive messages per second and receives back so many
feedback message, therefore the chances of interrupting somewhere in
the send are quite high.
The loop never breaks because psycopg2 is always replying with the
same flush as very last time, which was set during processing of
XLogData message.

> I think having a report of an issue, with a way to reproduce it is a first
> (and quite important) step. So thanks for doing that.
>
> That being said, I think those are two separate issues, with different
> causes and likely different fixes. I don't think fixing the xlog flush
> will resolve your issue, and vice versa.

Agree, these are different issues.

Regards,
--
Alexander Kukushkin

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Robert Haas

On Wed, May 1, 2019 at 12:50 PM Tom Lane  wrote:
> > Not strongly enough to argue about it very hard.  The current behavior
> > is a little weird, but it's a long way from being the weirdest thing
> > we ship, and it appears that we have no tangible evidence that it
> > causes a problem in practice.
>
> I think there's nothing that fails to suck about a hardwired "+ 10".

It avoids a performance regression without adding another GUC.

That may not be enough reason to keep it like that, but it is one
thing that does fail to suck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: hyrax vs. RelationBuildPartitionDesc

2019-05-01 Thread Robert Haas

On Wed, May 1, 2019 at 11:59 AM Andres Freund  wrote:
> The message I'm replying to is marked as an open item. Robert, what do
> you think needs to be done here before release?  Could you summarize,
> so we then can see what others think?

The original issue on this thread was that hyrax started running out
of memory when it hadn't been doing so previously.  That happened
because, for complicated reasons, commit
898e5e3290a72d288923260143930fb32036c00c resulted in the leak being
hit lots of times in CLOBBER_CACHE_ALWAYS builds instead of just once.
Commits 2455ab48844c90419714e27eafd235a85de23232 and
d3f48dfae42f9655425d1f58f396e495c7fb7812 fixed that problem.

In the email at issue, Tom complains about two things.  First, he
complains about the fact that I try to arrange things so that relcache
data remains valid for as long as required instead of just copying it.
Second, he complains about the fact repeated ATTACH and DETACH
PARTITION operations can produce a slow session-lifespan memory leak.

I think it's reasonable to fix the second issue for v12.  I am not
sure how important it is, because (1) the leak is small, (2) it seems
unlikely that anyone would execute enough ATTACH/DETACH PARTITION
operations in one backend for the leak to amount to anything
significant, and (3) if a relcache flush ever happens when you don't
have the relation open, all of the "leaked" memory will be un-leaked.
However, I believe that we could fix it as follows.  First, invent
rd_pdoldcxt and put the first old context there; if that pointer is
already in use, then parent the new context under the old one.
Second, in RelationDecrementReferenceCount, if the refcount hits 0,
nuke rd_pdoldcxt and set the pointer back to NULL.  With that change,
you would only keep the old PartitionDesc around until the ref count
hits 0, whereas at present, you keep the old PartitionDesc around
until an invalidation happens while the ref count is 0.

I think the first issue is not v12 material.  Tom proposed to fix it
by copying all the relevant data out of the relcache, but his own
performance results show a pretty significant hit, and AFAICS he
hasn't pointed to anything that's actually broken by the current
approach.  What I think should be done is actually generalize the
approach I took in this patch, so that instead of the partition
directory holding a reference count, the planner or executor holds
one.  Then not only could people who want information about the
PartitionDesc avoid copying stuff from the relcache, but maybe other
things as well.  I think that would be significantly better than
continuing to double-down on the current copy-everything approach,
which really only works well in a world where a table can't have all
that much metadata, which is clearly not true for PostgreSQL any more.
I'm not sure that Tom is actually opposed to this approach -- although
I may have misunderstood his position -- but where we disagree, I
think, is that I see what I did in this commit as a stepping-stone
toward a better world, and he sees it as something that should be
killed with fire until that better world has fully arrived.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 12:20:22 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > I see the bug. Turns out we need to figure out another way to solve the
> > assertion triggered by doing catalog updates within
> > RelationSetNewRelfilenode() - we can't just move the
> > SetReindexProcessing() before it.  When CCA is enabled, the
> > CommandCounterIncrement() near the tail of RelationSetNewRelfilenode()
> > triggers a rebuild of the catalog entries - but without the
> > SetReindexProcessing() those scans will try to use the index currently
> > being rebuilt.
> 
> Yeah.  I think what this demonstrates is that REINDEX INDEX has to have
> RelationSetIndexList logic similar to what REINDEX TABLE has, to control
> which indexes get updated when while we're rebuilding an index of
> pg_class.  In hindsight that seems glaringly obvious ... I wonder how we
> missed that when we built that infrastructure for REINDEX TABLE?

I'm not sure this is the right short-term answer. Why isn't it, for now,
sufficient to do what I suggested with RelationSetNewRelfilenode() not
doing the CommandCounterIncrement(), and reindex_index() then doing the
SetReindexProcessing() before a CommandCounterIncrement()? That's like
~10 line code change, and a few more with comments.

There is the danger that the current and above approach basically relies
on there not to be any non-inplace updates during reindex.  But at the
moment code does take care to use inplace updates
(cf. index_update_stats()).

It's not clear to me whether the approach of using
RelationSetIndexList() in reindex_index() would be meaningfully more
robust against non-inplace updates during reindex either - ISTM we'd
just as well skip the necessary index insertions if we hid the index
being rebuilt. Skipping to-be-rebuilt indexes works for
reindex_relation() because they're going to be rebuilt subsequently (and
thus the missing index rows don't matter) - but it'd not work for
reindexing a single index, because it'll not get the result at a later
stage.

> I'm pretty sure that infrastructure is my fault, so I'll take a
> whack at fixing this.
> 
> Did you figure out why this doesn't also happen in HEAD?

It does for me now, at least when just doing a reindex in isolation (CCA
tests would have taken too long last night). I'm not sure why I wasn't
previously able to trigger it and markhor hasn't run yet on master.

Greetings,

Andres Freund

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Tom Lane

I wrote:
> Did you figure out why this doesn't also happen in HEAD?

... actually, HEAD *is* broken with CCA, just differently.
I'm on it.

regards, tom lane

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Tom Lane

Robert Haas  writes:
> On Wed, May 1, 2019 at 12:15 PM Andres Freund  wrote:
>> My current inclination is to not do anything for v12. Robert, do you
>> disagree?

> Not strongly enough to argue about it very hard.  The current behavior
> is a little weird, but it's a long way from being the weirdest thing
> we ship, and it appears that we have no tangible evidence that it
> causes a problem in practice.

I think there's nothing that fails to suck about a hardwired "+ 10".

We should either remove that and use effective_io_concurrency as-is,
or decide that it's worth having a separate GUC for bulk operations.
At this stage of the cycle I'd incline to the former, but if somebody
is excited enough to prepare a patch for a new GUC, I wouldn't push
back on that solution.

regards, tom lane

Re: walsender vs. XLogBackgroundFlush during shutdown

2019-05-01 Thread Tomas Vondra


On Wed, May 01, 2019 at 08:53:15AM -0700, Andres Freund wrote:

Hi,

On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote:

The reason for that is pretty simple - the walsenders are doing logical
decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
But per XLogSendLogical() this only happens if

   if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
   {
   WalSndCaughtUp = true;
   ...
   }

That is, we need to get beyong GetFlushRecPtr(). But that may never
happen, because there may be incomplete (and unflushed) page in WAL
buffers, not forced out by any transaction. So if there's a WAL record
overflowing to the unflushed page, the walsender will never catch up.

Now, this situation is apparently expected, because WalSndWaitForWal()
does this:

   /*
* If we're shutting down, trigger pending WAL to be written out,
* otherwise we'd possibly end up waiting for WAL that never gets
* written, because walwriter has shut down already.
*/
   if (got_STOPPING)
   XLogBackgroundFlush();

but unfortunately that does not actually do anything, because right at
the very beginning XLogBackgroundFlush() does this:

   /* back off to last completed page boundary */
   WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;



That is, it intentionally ignores the incomplete page, which means the
walsender can't read the record and reach GetFlushRecPtr().

XLogBackgroundFlush() does this since (at least) 2007, so how come we
never had issues with this before?


I assume that's because of the following logic:
/* if we have already flushed that far, consider async commit records */
if (WriteRqst.Write <= LogwrtResult.Flush)
{
SpinLockAcquire(&XLogCtl->info_lck);
WriteRqst.Write = XLogCtl->asyncXactLSN;
SpinLockRelease(&XLogCtl->info_lck);
flexible = false;   /* ensure it all gets written */
}

and various pieces of the code doing XLogSetAsyncXactLSN() to force
flushing.  I wonder if the issue is that we're better at avoiding
unnecessary WAL to be written due to
6ef2eba3f57f17960b7cd4958e18aa79e357de2f



I don't think so, because (a) there are no async commits involved, and (b)
we originally ran into the issue on 9.6 and 6ef2eba3f57f1 is only in 10+.




But I don't think we're safe without the failover slots patch, because
any output plugin can do something similar - say, LogLogicalMessage() or
something like that. I'm not aware of a plugin doing such things, but I
don't think it's illegal / prohibited either. (Of course, plugins that
generate WAL won't be useful for decoding on standby in the future.)


FWIW, I'd consider such an output plugin outright broken.



Why? Is that prohibited somewhere, either explicitly or implicitly? I do
see obvious issues with generating WAL from plugin (infinite cycle and so
on), but I suppose that's more a regular coding issue than something that
would make all plugins doing that broken.

FWIW I don't see any particular need to generate WAL from output plugins,
I mentioned it mostly jsut as a convenient way to trigger the issue. I
suppose there are other ways to generate a bit of WAL during shutdown.




So what I think we should do is to tweak XLogBackgroundFlush() so that
during shutdown it skips the back-off to page boundary, and flushes even
the last piece of WAL. There are only two callers anyway, so something
like XLogBackgroundFlush(bool) would be simple enough.


I think it then just ought to be a normal XLogFlush(). I.e. something
along the lines of:

/*
 * If we're shutting down, trigger pending WAL to be written 
out,
 * otherwise we'd possibly end up waiting for WAL that never 
gets
 * written, because walwriter has shut down already.
 */
if (got_STOPPING && !RecoveryInProgress())
XLogFlush(GetXLogInsertRecPtr());

/* Update our idea of the currently flushed position. */
if (!RecoveryInProgress())
RecentFlushPtr = GetFlushRecPtr();
else
RecentFlushPtr = GetXLogReplayRecPtr(NULL);



Perhaps. That would work too, I guess.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: New vacuum option to do only freezing

2019-05-01 Thread Robert Haas

On Wed, May 1, 2019 at 12:14 PM Andres Freund  wrote:
> On 2019-04-06 16:13:53 +0900, Michael Paquier wrote:
> > On Sat, Apr 06, 2019 at 11:31:31AM +0900, Masahiko Sawada wrote:
> > > Yes, but Fujii-san pointed out that this option doesn't support toast
> > > tables and I think there is not specific reason why not supporting
> > > them. So it might be good to add toast.vacuum_index_cleanup. Attached
> > > patch.
> >
> > Being able to control that option at toast level sounds sensible.  I
> > have added an open item about that.
>
> Robert, what is your stance on this open item? It's been an open item
> for about three weeks, without any progress.

The actual bug in this patch needs to be fixed, but I see we have
another open item for that.  This open item, as I understand it, is
all about whether we should add another reloption so that you can
control this behavior separately for TOAST tables.  In my opinion,
that's not a critical change and the open item should be dropped, but
others might see it differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> There's still an open item "Debate INFO messages in ATTACH PARTITION and
> SET NOT NULL" for this thread.  Is there anything further to be done? Or
> are we content with this for v12?

IIRC, that's based on my complaint that these messages have no business
being INFO level.  I'm definitely not content with the current state.

I proposed that we didn't need those messages at all, which Robert pushed
back on, but I'd be willing to compromise by reducing them to NOTICE or
DEBUG level.

The actually intended use of INFO level is to mark messages that the user
actively asked for (via VERBOSE or some morally-equivalent option), which
is why that level has such high priority.  If we had something like a
VERBOSE option for ALTER TABLE, it'd make plenty of sense to follow
VACUUM's precedent:

if (params->options & VACOPT_VERBOSE)
elevel = INFO;
else
elevel = DEBUG2;

It seems a bit late to be inventing such a thing for v12 though.

In the meantime, I'm not very content with random subsets of
ALTER TABLE acting as if they have been granted permission
to be VERBOSE.  It's unfriendly, and it's inconsistent because
there are so many other expensive ALTER TABLE actions that don't
do this.

regards, tom lane

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Robert Haas

On Wed, May 1, 2019 at 12:15 PM Andres Freund  wrote:
> On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
> > I'm not yet convinced it's necessary to create a new GUC, but also not
> > strongly opposed.  I've created an open items issue for it, so we don't
> > forget.
>
> My current inclination is to not do anything for v12. Robert, do you
> disagree?

Not strongly enough to argue about it very hard.  The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-05-01 Thread Tom Lane

Andres Freund  writes:
> I see the bug. Turns out we need to figure out another way to solve the
> assertion triggered by doing catalog updates within
> RelationSetNewRelfilenode() - we can't just move the
> SetReindexProcessing() before it.  When CCA is enabled, the
> CommandCounterIncrement() near the tail of RelationSetNewRelfilenode()
> triggers a rebuild of the catalog entries - but without the
> SetReindexProcessing() those scans will try to use the index currently
> being rebuilt.

Yeah.  I think what this demonstrates is that REINDEX INDEX has to have
RelationSetIndexList logic similar to what REINDEX TABLE has, to control
which indexes get updated when while we're rebuilding an index of
pg_class.  In hindsight that seems glaringly obvious ... I wonder how we
missed that when we built that infrastructure for REINDEX TABLE?

I'm pretty sure that infrastructure is my fault, so I'll take a
whack at fixing this.

Did you figure out why this doesn't also happen in HEAD?

regards, tom lane

Re: using index or check in ALTER TABLE SET NOT NULL

2019-05-01 Thread Andres Freund

On 2019-03-25 11:09:47 -0400, Robert Haas wrote:
> On Mon, Mar 25, 2019 at 3:51 AM David Steele  wrote:
> > On 3/17/19 3:40 PM, David Rowley wrote:
> > > On Thu, 14 Mar 2019 at 06:24, Robert Haas  wrote:
> > >
> > > I think we can mark this patch as committed now as I don't think the
> > > lack of a way to test it is likely to cause it to be reverted.
> >
> > Should we mark this as committed now or is there more tweaking to be done?
> 
> I have now marked it as Committed.

There's still an open item "Debate INFO messages in ATTACH PARTITION and
SET NOT NULL" for this thread.  Is there anything further to be done? Or
are we content with this for v12?

- Andres

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Bruce Momjian

On Wed, May  1, 2019 at 09:08:54AM -0700, Andres Freund wrote:
> So sure, there's a few typo fixes, one bugfix, and one buildfarm test
> stability issue. Doesn't seem crazy for a nontrivial improvement.

OK, my ignorant opinion was just based on the length of discussion
threads.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: compiler warning in pgcrypto imath.c

2019-05-01 Thread Andres Freund

Hi Noah,

On 2019-03-23 00:02:36 -0700, Noah Misch wrote:
> On Sat, Mar 23, 2019 at 10:20:16AM +0900, Michael Paquier wrote:
> > On Fri, Mar 22, 2019 at 08:20:53PM -0400, Jeff Janes wrote:
> > > PostgreSQL 12devel on aarch64-unknown-linux-gnu, compiled by gcc
> > > (Ubuntu/Linaro 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit
> > 
> > Adding Noah in CC as he has done the update of imath lately.
> > 
> > > The attached patch adds PG_USED_FOR_ASSERTS_ONLY to silence it.  Perhaps
> > > there is a better way, given that we want to change imath.c as little as
> > > possible from its upstream?
> > 
> > Maybe others have better ideas, but marking the variable with
> > PG_USED_FOR_ASSERTS_ONLY as you propose seems like the least invasive
> > method of all.
> 
> That patch looks good.  Thanks.  The main alternative would be to pass
> -Wno-unused for this file.  Since you're proposing only one instance
> PG_USED_FOR_ASSERTS_ONLY, I favor PG_USED_FOR_ASSERTS_ONLY over -Wno-unused.

This is marked as an open item, owned by you. Could you commit the
patch or otherwise resovle the issue?

Greetings,

Andres Freund

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Andres Freund

Hi,

On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
> I'm not yet convinced it's necessary to create a new GUC, but also not
> strongly opposed.  I've created an open items issue for it, so we don't
> forget.

My current inclination is to not do anything for v12. Robert, do you
disagree?

Greetings,

Andres Freund

Re: New vacuum option to do only freezing

2019-05-01 Thread Andres Freund

Hi,

On 2019-04-06 16:13:53 +0900, Michael Paquier wrote:
> On Sat, Apr 06, 2019 at 11:31:31AM +0900, Masahiko Sawada wrote:
> > Yes, but Fujii-san pointed out that this option doesn't support toast
> > tables and I think there is not specific reason why not supporting
> > them. So it might be good to add toast.vacuum_index_cleanup. Attached
> > patch.
>
> Being able to control that option at toast level sounds sensible.  I
> have added an open item about that.

Robert, what is your stance on this open item? It's been an open item
for about three weeks, without any progress.

Greetings,

Andres Freund

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-05-01 Thread Andres Freund

Hi,

On 2019-04-08 19:22:04 +0900, Fujii Masao wrote:
> On Mon, Apr 8, 2019 at 5:30 PM Masahiko Sawada  wrote:
> > "TRUNCATE" option for vacuum command should be added to the open items?
> 
> Yes, I think.
> Attached is the patch which adds TRUNCATE option into VACUUM.

This has been an open item for about three weeks. Please work on
resolving this soon.

- Andres

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 11:28:11 -0400, Bruce Momjian wrote:
> On Wed, May  1, 2019 at 08:24:25AM -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2019-04-18 14:10:29 -0700, Andres Freund wrote:
> > > My compromise suggestion would be to try to give John and Amit ~2 weeks
> > > to come up with a cleanup proposal, and then decide whether to 1) revert
> > > 2) apply the new patch, 3) decide to live with the warts for 12, and
> > > apply the patch in 13. As we would already have a patch, 3) seems like
> > > it'd be more tenable than without.
> >
> > I think decision time has come. My tentative impression is that we're
> > not there yet, and should revert the improvements in v12, and apply the
> > improved version early in v13. As a second choice, we should live with
> > the current approach, if John and Amit "promise" further effort to clean
> > this up for v13.
>
> My ignorant opinion is that I have been surprised by the churn caused by
> this change, and have therefore questioned the value of it.

Hm, I don't think there has been that much churn? Sure, there was a
revert to figure out a regression test instability, but that doesn't
seem that bad. Relevant commits in date order are:


andres-classification: cleanup
commit 06c8a5090ed9ec188557a86d4de11384f5128ec0
Author: Amit Kapila 
Date:   2019-03-16 06:55:56 +0530

Improve code comments in b0eaa4c51b.

Author: John Naylor
Discussion: 
https://postgr.es/m/CACPNZCswjyGJxTT=mxHgK=Z=mJ9uJ4WEx_UO=bnwpr_i0ea...@mail.gmail.com


andres-classification: incremental improvement
commit 13e8643bfc29d3c1455c0946281cdfc24758ffb6
Author: Amit Kapila 
Date:   2019-03-15 08:25:57 +0530

During pg_upgrade, conditionally skip transfer of FSMs.


andres-classification: additional tests
commit 6f918159a97acf76ee2512e44f5ed5dcaaa0d923
Author: Amit Kapila 
Date:   2019-03-12 08:14:28 +0530

Add more tests for FSM.


andres-classification: cleanup
commit a6e48da08844eeb5a72c8b59dad3aaab6e891fac
Author: Amit Kapila 
Date:   2019-03-11 08:16:14 +0530

Fix typos in commit 8586bf7ed8.


andres-classification: bugfix
commit 9c32e4c35026bd52aaf340bfe7594abc653e42f0
Author: Amit Kapila 
Date:   2019-03-01 07:38:47 +0530

Clear the local map when not used.


andres-classification: docs addition
commit 29d108cdecbe918452e70041d802cc515b2d56b8
Author: Amit Kapila 
Date:   2019-02-20 17:37:39 +0530

Doc: Update the documentation for FSM behavior for small tables.


andres-classification: regression test stability
commit 08ecdfe7e5e0a31efbe1d58fefbe085b53bc79ca
Author: Amit Kapila 
Date:   2019-02-04 10:08:29 +0530

Make FSM test portable.


andres-classification: feature
commit b0eaa4c51bbff3e3c600b11e5d104d6feb9ca77f
Author: Amit Kapila 
Date:   2019-02-04 07:49:15 +0530

Avoid creation of the free space map for small heap relations, take 2.


So sure, there's a few typo fixes, one bugfix, and one buildfarm test
stability issue. Doesn't seem crazy for a nontrivial improvement.


> Frankly, there has been so much churn I am unclear if it can be easily 
> reverted.

Doesn't seem that hard? There's some minor conflicts, but nothing bad?

Greetings,

Andres Freund

Re: hyrax vs. RelationBuildPartitionDesc

2019-05-01 Thread Andres Freund

Hi

The message I'm replying to is marked as an open item. Robert, what do
you think needs to be done here before release?  Could you summarize,
so we then can see what others think?

Greetings,

Andres Freund

Re: walsender vs. XLogBackgroundFlush during shutdown

2019-05-01 Thread Andres Freund

Hi,

On 2019-05-01 02:28:45 +0200, Tomas Vondra wrote:
> The reason for that is pretty simple - the walsenders are doing logical
> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
> But per XLogSendLogical() this only happens if
> 
>if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
>{
>WalSndCaughtUp = true;
>...
>}
> 
> That is, we need to get beyong GetFlushRecPtr(). But that may never
> happen, because there may be incomplete (and unflushed) page in WAL
> buffers, not forced out by any transaction. So if there's a WAL record
> overflowing to the unflushed page, the walsender will never catch up.
> 
> Now, this situation is apparently expected, because WalSndWaitForWal()
> does this:
> 
>/*
> * If we're shutting down, trigger pending WAL to be written out,
> * otherwise we'd possibly end up waiting for WAL that never gets
> * written, because walwriter has shut down already.
> */
>if (got_STOPPING)
>XLogBackgroundFlush();
> 
> but unfortunately that does not actually do anything, because right at
> the very beginning XLogBackgroundFlush() does this:
> 
>/* back off to last completed page boundary */
>WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;

> That is, it intentionally ignores the incomplete page, which means the
> walsender can't read the record and reach GetFlushRecPtr().
> 
> XLogBackgroundFlush() does this since (at least) 2007, so how come we
> never had issues with this before?

I assume that's because of the following logic:
/* if we have already flushed that far, consider async commit records */
if (WriteRqst.Write <= LogwrtResult.Flush)
{
SpinLockAcquire(&XLogCtl->info_lck);
WriteRqst.Write = XLogCtl->asyncXactLSN;
SpinLockRelease(&XLogCtl->info_lck);
flexible = false;   /* ensure it all gets written */
}

and various pieces of the code doing XLogSetAsyncXactLSN() to force
flushing.  I wonder if the issue is that we're better at avoiding
unnecessary WAL to be written due to
6ef2eba3f57f17960b7cd4958e18aa79e357de2f


> But I don't think we're safe without the failover slots patch, because
> any output plugin can do something similar - say, LogLogicalMessage() or
> something like that. I'm not aware of a plugin doing such things, but I
> don't think it's illegal / prohibited either. (Of course, plugins that
> generate WAL won't be useful for decoding on standby in the future.)

FWIW, I'd consider such an output plugin outright broken.


> So what I think we should do is to tweak XLogBackgroundFlush() so that
> during shutdown it skips the back-off to page boundary, and flushes even
> the last piece of WAL. There are only two callers anyway, so something
> like XLogBackgroundFlush(bool) would be simple enough.

I think it then just ought to be a normal XLogFlush(). I.e. something
along the lines of:

/*
 * If we're shutting down, trigger pending WAL to be written 
out,
 * otherwise we'd possibly end up waiting for WAL that never 
gets
 * written, because walwriter has shut down already.
 */
if (got_STOPPING && !RecoveryInProgress())
XLogFlush(GetXLogInsertRecPtr());

/* Update our idea of the currently flushed position. */
if (!RecoveryInProgress())
RecentFlushPtr = GetFlushRecPtr();
else
RecentFlushPtr = GetXLogReplayRecPtr(NULL);


Greetings,

Andres Freund

Re: New vacuum option to do only freezing

2019-05-01 Thread Andres Freund

Hi,

This thread is referenced an open item, and we ought to make some
progress on it.

On a more cosmetic note:

On 2019-04-16 09:10:19 -0700, Andres Freund wrote:
> On 2019-04-16 12:01:36 -0400, Tom Lane wrote:
> > (BTW, I don't understand why that code will throw "found xmin %u from
> > before relfrozenxid %u" if HeapTupleHeaderXminFrozen is true?  Shouldn't
> > the whole if-branch at lines 6113ff be skipped if xmin_frozen?)
>
> I *think* that just looks odd, but isn't actively wrong. That's because
> TransactionIdIsNormal() won't trigger, as:
>
> #define HeapTupleHeaderGetXmin(tup) \
> ( \
>   HeapTupleHeaderXminFrozen(tup) ? \
>   FrozenTransactionId : HeapTupleHeaderGetRawXmin(tup) \
> )
>
> which afaict makes
>   xmin_frozen = ((xid == FrozenTransactionId) ||
>  HeapTupleHeaderXminFrozen(tuple));
> redundant.
>
> Looks like that was introduced relatively recently, in
>
> commit d2599ecfcc74fea9fad1720a70210a740c716730
> Author: Alvaro Herrera 
> Date:   2018-05-04 15:24:44 -0300
>
> Don't mark pages all-visible spuriously
>
>
> @@ -6814,6 +6815,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>
> /* Process xmin */
> xid = HeapTupleHeaderGetXmin(tuple);
> +   xmin_frozen = ((xid == FrozenTransactionId) ||
> +  HeapTupleHeaderXminFrozen(tuple));
> if (TransactionIdIsNormal(xid))
> {
> if (TransactionIdPrecedes(xid, relfrozenxid))
> @@ -6832,9 +6835,8 @@ heap_prepare_freeze_tuple(HeapTupleHeader tuple,
>
> frz->t_infomask |= HEAP_XMIN_FROZEN;
> changed = true;
> +   xmin_frozen = true;
> }
> -   else
> -   totally_frozen = false;
> }

Alvaro, could we perhaps clean this up a bit? This is pretty confusing
looking.  I think this probably could just be changed to

boolxmin_frozen = false;

xid = HeapTupleHeaderGetXmin(tuple);

if (xid == FrozenTransactionId)
xmin_frozen = true;
else if (TransactionIdIsNormal(xid))

or somesuch.  There's no need to check for
HeapTupleHeaderXminFrozen(tuple) etc, because HeapTupleHeaderGetXmin()
already does so - and if it didn't, the issue Tom points out would be
problematic.

Greetings,

Andres Freund

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Bruce Momjian

On Wed, May  1, 2019 at 08:24:25AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-04-18 14:10:29 -0700, Andres Freund wrote:
> > My compromise suggestion would be to try to give John and Amit ~2 weeks
> > to come up with a cleanup proposal, and then decide whether to 1) revert
> > 2) apply the new patch, 3) decide to live with the warts for 12, and
> > apply the patch in 13. As we would already have a patch, 3) seems like
> > it'd be more tenable than without.
> 
> I think decision time has come. My tentative impression is that we're
> not there yet, and should revert the improvements in v12, and apply the
> improved version early in v13. As a second choice, we should live with
> the current approach, if John and Amit "promise" further effort to clean
> this up for v13.

My ignorant opinion is that I have been surprised by the churn caused by
this change, and have therefore questioned the value of it.  Frankly,
there has been so much churn I am unclear if it can be easily reverted.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-05-01 Thread Andres Freund

Hi,

On 2019-04-18 14:10:29 -0700, Andres Freund wrote:
> My compromise suggestion would be to try to give John and Amit ~2 weeks
> to come up with a cleanup proposal, and then decide whether to 1) revert
> 2) apply the new patch, 3) decide to live with the warts for 12, and
> apply the patch in 13. As we would already have a patch, 3) seems like
> it'd be more tenable than without.

I think decision time has come. My tentative impression is that we're
not there yet, and should revert the improvements in v12, and apply the
improved version early in v13. As a second choice, we should live with
the current approach, if John and Amit "promise" further effort to clean
this up for v13.

Greetings,

Andres Freund

Re: to_timestamp docs

2019-05-01 Thread Bruce Momjian

On Wed, May  1, 2019 at 10:01:50AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I don't think the changes made in PG 12 are documented accurately.
> 
> That code is swapped out of my head at the moment, but it looks
> to me like the para before the one you changed is where we discuss
> the behavior for whitespace.  I'm not sure that this change is
> right, or an improvement, in the context of both paras.

Thanks.  I think I see the sentence you are thinking of:

   to_timestamp and to_date
   skip multiple blank spaces at the beginning of the input string
   and around date and time values unless the FX
   option is used.

However, first, it is unclear what 'skip' means here, i.e., does it mean
multiple blank spaces become a single space, or they are ignored.  That
should be clarified, though I am unclear if that matters based on how
separators are handled.  Also, I think "blank spaces" should be
"whitespace".

Second, I see inconsistent behaviour around the use of FX for various
patterns, e.g.:

SELECT to_timestamp('5   1976','FXDD_FX');
  to_timestamp

 1976-01-05 00:00:00-05

SELECT to_timestamp('JUL JUL','FXMON_FXMON');
  to_timestamp
-
 0001-07-01 00:00:00-04:56:02 BC

SELECT to_timestamp('JULJUL','FXMON_FXMON');
ERROR:  invalid value "   " for "MON"
DETAIL:  The given value did not match any of the allowed values for 
this field.

It seems DD and  (as numerics?) in FX mode eat trailing whitespace,
while MON does not?  Also, I used these queries to determine it is
"trailing" whitespace that "FXMON" controls:

SELECT to_timestamp('JUL   JUL JUL','MON_FXMON_MON');
  to_timestamp
-
 0001-07-01 00:00:00-04:56:02 BC

SELECT to_timestamp('JUL JUL   JUL','MON_FXMON_MON');
ERROR:  invalid value "  J" for "MON"
DETAIL:  The given value did not match any of the allowed values for 
this field.

Once we figure out how it is behaving I think we can pull together the
FX text above to reference the separator text below.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: walsender vs. XLogBackgroundFlush during shutdown

2019-05-01 Thread Tomas Vondra


On Wed, May 01, 2019 at 02:13:10PM +0200, Alexander Kukushkin wrote:

Hi Tomas,

On Wed, 1 May 2019 at 02:28, Tomas Vondra  wrote:


I see there's an ongoing discussion about race conditions in walreceiver
blocking shutdown, so let me start a new thread about a race condition in
walsender during shutdown.

The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
waiting for walsender processes to catch-up and terminate indefinitely.


I can confirm, during the past couple of years we observed such a
problem a few times and this is really annoying.


The reason for that is pretty simple - the walsenders are doing logical
decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
But per XLogSendLogical() this only happens if

if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
{
WalSndCaughtUp = true;
...
}


After a couple of days investigating and debugging I came to a
slightly different conclusion, WalSndCaughtUp is set to true in my
case.
Since I am mostly a python person, I decided to use psycopg2 for my
investigation. I took an example from
http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start
as a starting point, created a slot and started the script.
I wasn't running any transactions, therefore the DemoConsumer.__call__
was never executed and cursor.send_feedback(flush_lsn=msg.data_start)
was never called either. Basically, the only what the psycopg2
internals was doing - periodically sending keepalive messages or
replying to keepalives sent by postgres.


OK, so that seems like a separate issue, somewhat unrelated to the issue I
reported. And I'm not sure it's a walsender issue - it seems it might be a
psycopg2 issue in not reporting the flush properly, no?


Actually, the same problem will happen even in the case when the
consumer script receives some message, but not very intensively, but
it is just a bit harder to reproduce it.

If you attach to the walsender with gdb, you'll see the following picture:
(gdb) bt
#0  0x7fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
#1  0x55cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:318
#2  0x55cb93aa7b87 in secure_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:265
#3  0x55cb93ab6bf9 in internal_flush () at pqcomm.c:1433
#4  0x55cb93ab6b89 in socket_flush () at pqcomm.c:1409
#5  0x55cb93dac30b in send_message_to_frontend
(edata=0x55cb942b4380 ) at elog.c:3317
#6  0x55cb93da8973 in EmitErrorReport () at elog.c:1481
#7  0x55cb93da5abf in errfinish (dummy=0) at elog.c:481
#8  0x55cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
"sending replication keepalive") at elog.c:1376
#9  0x55cb93bcae71 in WalSndKeepalive (requestReply=true) at
walsender.c:3358


Is it stuck in the send() call forever, or did you happen to grab
this bracktrace?



All above text probably looks like a brain dump, but I don't think
that it conflicts with Tomas's findings it rather compliments them.
I am very glad that now I know how to mitigate the problem on the
client side, but IMHO it is also very important to fix the server
behavior if it is ever possible.



I think having a report of an issue, with a way to reproduce it is a first
(and quite important) step. So thanks for doing that.

That being said, I think those are two separate issues, with different
causes and likely different fixes. I don't think fixing the xlog flush
will resolve your issue, and vice versa.

FWIW attached is a patch that I used to reliably trigger the xlog flush
issue - it simply adds LogLogicalMessage() to commit handler in the
built-in output plugin. So all you need to do is create a subscription,
start generating commit and trigger a restart. The message is 8kB, so it's
definitely long enough to overflow to the next xlog page.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/replication/pgoutput/pgoutput.c 
b/src/backend/replication/pgoutput/pgoutput.c
index bf64c8e4a4..a18864666e 100644
--- a/src/backend/replication/pgoutput/pgoutput.c
+++ b/src/backend/replication/pgoutput/pgoutput.c
@@ -16,6 +16,7 @@
 
 #include "replication/logical.h"
 #include "replication/logicalproto.h"
+#include "replication/message.h"
 #include "replication/origin.h"
 #include "replication/pgoutput.h"
 
@@ -247,11 +248,19 @@ static void
 pgoutput_commit_txn(LogicalDecodingContext *ctx, ReorderBufferTXN *txn,
XLogRecPtr commit_lsn)
 {
+   charmessage[8192];
+
OutputPluginUpdateProgress(ctx);
 
OutputPluginPrepareWrite(ctx, true);
logicalrep_write_commit(ctx->out, txn, commit_lsn);
OutputPluginWrite(ctx, true);
+
+   /* a simple string */
+   memset(message, 'a', 8

Re: to_timestamp docs

2019-05-01 Thread Tom Lane

Bruce Momjian  writes:
> I don't think the changes made in PG 12 are documented accurately.

That code is swapped out of my head at the moment, but it looks
to me like the para before the one you changed is where we discuss
the behavior for whitespace.  I'm not sure that this change is
right, or an improvement, in the context of both paras.

regards, tom lane

to_timestamp docs

2019-05-01 Thread Bruce Momjian

I don't think the changes made in PG 12 are documented accurately.  It
currently says:

to_timestamp and to_date matches any single separator in the input
string or is skipped

However, I think it is more accurate to say _multiple_ whitespace can
also be matched by a single separator:

SELECT to_timestamp('%1976','_');
  to_timestamp

 1976-01-01 00:00:00-05

SELECT to_timestamp('%%1976','_');
ERROR:  invalid value "%197" for ""
DETAIL:  Value must be an integer.

-- two spaces
--> SELECT to_timestamp('  1976','_');
  to_timestamp

 1976-01-01 00:00:00-05

--> SELECT to_timestamp(E'\t\t\t1976','_');
  to_timestamp

 1976-01-01 00:00:00-05

Proposed patch attached.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
new file mode 100644
index d751766..b27f4d4
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT regexp_match('abc01234xyz', '(?:(
*** 6390,6396 

 A separator (a space or non-letter/non-digit character) in the template string of
 to_timestamp and to_date
!matches any single separator in the input string or is skipped,
 unless the FX option is used.
 For example, to_timestamp('2000JUN', '///MON') and
 to_timestamp('2000/JUN', ' MON') work, but
--- 6390,6396 

 A separator (a space or non-letter/non-digit character) in the template string of
 to_timestamp and to_date
!matches multiple whitespace characters in the input string, a single separator, or nothing,
 unless the FX option is used.
 For example, to_timestamp('2000JUN', '///MON') and
 to_timestamp('2000/JUN', ' MON') work, but

Re: walsender vs. XLogBackgroundFlush during shutdown

2019-05-01 Thread Alexander Kukushkin

 Hi Tomas,

On Wed, 1 May 2019 at 02:28, Tomas Vondra  wrote:

> I see there's an ongoing discussion about race conditions in walreceiver
> blocking shutdown, so let me start a new thread about a race condition in
> walsender during shutdown.
>
> The symptoms are rather simple - 'pg_ctl -m fast shutdown' gets stuck,
> waiting for walsender processes to catch-up and terminate indefinitely.

I can confirm, during the past couple of years we observed such a
problem a few times and this is really annoying.

> The reason for that is pretty simple - the walsenders are doing logical
> decoding, and during shutdown they're waiting for WalSndCaughtUp=true.
> But per XLogSendLogical() this only happens if
>
> if (logical_decoding_ctx->reader->EndRecPtr >= GetFlushRecPtr())
> {
> WalSndCaughtUp = true;
> ...
> }

After a couple of days investigating and debugging I came to a
slightly different conclusion, WalSndCaughtUp is set to true in my
case.
Since I am mostly a python person, I decided to use psycopg2 for my
investigation. I took an example from
http://initd.org/psycopg/docs/advanced.html#logical-replication-quick-start
as a starting point, created a slot and started the script.
I wasn't running any transactions, therefore the DemoConsumer.__call__
was never executed and cursor.send_feedback(flush_lsn=msg.data_start)
was never called either. Basically, the only what the psycopg2
internals was doing - periodically sending keepalive messages or
replying to keepalives sent by postgres. In the postgres debug log
they are visible as:

2019-05-01 12:58:32.785 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0

If you try to do a fast shutdown of postgres while the script is
running, it will never finish, and in the postgres log you will get
indefinite stream of messages:
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.880 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  sending replication keepalive
2019-05-01 13:00:02.881 CEST [13939] DEBUG:  write 0/0 flush 0/0 apply 0/0

Actually, the same problem will happen even in the case when the
consumer script receives some message, but not very intensively, but
it is just a bit harder to reproduce it.

If you attach to the walsender with gdb, you'll see the following picture:
(gdb) bt
#0  0x7fd6623d296a in __libc_send (fd=8, buf=0x55cb958dca08,
len=94, flags=0) at ../sysdeps/unix/sysv/linux/send.c:28
#1  0x55cb93aa7ce9 in secure_raw_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:318
#2  0x55cb93aa7b87 in secure_write (port=0x55cb958d71e0,
ptr=0x55cb958dca08, len=94) at be-secure.c:265
#3  0x55cb93ab6bf9 in internal_flush () at pqcomm.c:1433
#4  0x55cb93ab6b89 in socket_flush () at pqcomm.c:1409
#5  0x55cb93dac30b in send_message_to_frontend
(edata=0x55cb942b4380 ) at elog.c:3317
#6  0x55cb93da8973 in EmitErrorReport () at elog.c:1481
#7  0x55cb93da5abf in errfinish (dummy=0) at elog.c:481
#8  0x55cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
"sending replication keepalive") at elog.c:1376
#9  0x55cb93bcae71 in WalSndKeepalive (requestReply=true) at
walsender.c:3358
#10 0x55cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
) at walsender.c:2872
#11 0x55cb93bc9155 in WalSndLoop (send_data=0x55cb93bc9e29
) at walsender.c:2194
#12 0x55cb93bc7b11 in StartLogicalReplication (cmd=0x55cb95931cc0)
at walsender.c:1109
#13 0x55cb93bc83d6 in exec_replication_command
(cmd_string=0x55cb958b2360 "START_REPLICATION SLOT \"test\" LOGICAL
0/") at walsender.c:1541
#14 0x55cb93c31653 in PostgresMain (argc=1, argv=0x55cb958deb68,
dbname=0x55cb958deb48 "postgres", username=0x55cb958deb28
"akukushkin") at postgres.c:4178
#15 0x55cb93b95185 in BackendRun (port=0x55cb958d71e0) at postmaster.c:4361
#16 0x55cb93b94824 in BackendStartup (port=0x55cb958d71e0) at
postmaster.c:4033
#17 0x55cb93b90ccd in ServerLoop () at postmaster.c:1706
#18 0x55cb93b90463 in PostmasterMain (argc=3, argv=0x55cb958ac710)
at postmaster.c:1379
#19 0x55cb93abb08e in main (argc=3, argv=0x55cb958ac710) at main.c:228
(gdb) f 10
#10 0x55cb93bca062 in WalSndDone (send_data=0x55cb93bc9e29
) at walsender.c:2872
2872WalSndKeepalive(true);
(gdb) p WalSndCaughtUp
$1 = true
(gdb) p *MyWalSnd
$2 = {pid = 21845, state = WALSNDSTATE_STREAMING, sentPtr = 23586

Re: [PATCH v4] Add \warn to psql

2019-05-01 Thread Fabien COELHO





I guess that you have a verbose ~/.psqlrc.

Can you try with adding -X to psql option when calling psql from the tap
test?


Ah, true. This patch works for me:

diff --git a/src/bin/psql/t/001_psql.pl b/src/bin/psql/t/001_psql.pl
index 32dd43279b..637baa94c9 100644
--- a/src/bin/psql/t/001_psql.pl
+++ b/src/bin/psql/t/001_psql.pl
@@ -20,7 +20,7 @@ sub psql
  {
 local $Test::Builder::Level = $Test::Builder::Level + 1;
 my ($opts, $stat, $in, $out, $err, $name) = @_;
-   my @cmd = ('psql', split /\s+/, $opts);
+   my @cmd = ('psql', '-X', split /\s+/, $opts);
 $node->command_checks_all(\@cmd, $stat, $out, $err, $name, $in);
 return;
  }


Please find attached :)


Good. Works for me, even with a verbose .psqlrc. Switched back to ready.

--
Fabien.

Re: [PATCH v4] Add \warn to psql

2019-05-01 Thread David Fetter

On Wed, May 01, 2019 at 10:05:44AM +0300, Arthur Zakirov wrote:
> (Unfortunately I accidentally sent my previous two messages using my personal
> email address because of my email client configuration. This address is not
> verified by PostgreSQL.org services and messages didn't reach hackers mailing
> lists, so I recent latest message).
> 
> On Tue, Apr 30, 2019 at 4:46 PM Fabien COELHO  wrote:
> > > Unfortunately new TAP test doesn't pass on my machine. I'm not good at 
> > > Perl
> > > and didn't get the reason of the failure quickly.
> >
> > I guess that you have a verbose ~/.psqlrc.
> >
> > Can you try with adding -X to psql option when calling psql from the tap
> > test?
> 
> Ah, true. This patch works for me:
> 
> diff --git a/src/bin/psql/t/001_psql.pl b/src/bin/psql/t/001_psql.pl
> index 32dd43279b..637baa94c9 100644
> --- a/src/bin/psql/t/001_psql.pl
> +++ b/src/bin/psql/t/001_psql.pl
> @@ -20,7 +20,7 @@ sub psql
>   {
>  local $Test::Builder::Level = $Test::Builder::Level + 1;
>  my ($opts, $stat, $in, $out, $err, $name) = @_;
> -   my @cmd = ('psql', split /\s+/, $opts);
> +   my @cmd = ('psql', '-X', split /\s+/, $opts);
>  $node->command_checks_all(\@cmd, $stat, $out, $err, $name, $in);
>  return;
>   }

Please find attached :)

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 66daae6f73e704ba05dec83b70eebabf9470d768 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Sun, 21 Apr 2019 11:08:40 -0700
Subject: [PATCH v7] Add \warn to psql
To: hackers
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="2.20.1"

This is a multi-part message in MIME format.
--2.20.1
Content-Type: text/plain; charset=UTF-8; format=fixed
Content-Transfer-Encoding: 8bit


- Does what it says on the label
- Adds TAP tests for same
- In passing, expose the -n option for \echo, \qecho, and \warn in \? output

 create mode 100755 src/bin/psql/t/001_psql.pl


--2.20.1
Content-Type: text/x-patch; name="v7-0001-Add-warn-to-psql.patch"
Content-Transfer-Encoding: 8bit
Content-Disposition: attachment; filename="v7-0001-Add-warn-to-psql.patch"

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index b86764003d..e474efb521 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -3220,6 +3220,25 @@ testdb=> \setenv LESS -imx4F
 
   
 
+  
+\warn text [ ... ]
+
+
+Prints the arguments to the standard error, separated by one
+space and followed by a newline. This can be useful to
+intersperse information in the output of scripts when not polluting
+standard output is desired. For example:
+
+
+=> \warn `date`
+Sun Apr 28 21:00:10 PDT 2019
+
+If the first argument is an unquoted -n the trailing
+newline is not written.
+
+
+  
+
 
   
 \watch [ seconds ]
diff --git a/src/bin/psql/.gitignore b/src/bin/psql/.gitignore
index c2862b12d6..d324c1c1fa 100644
--- a/src/bin/psql/.gitignore
+++ b/src/bin/psql/.gitignore
@@ -3,3 +3,4 @@
 /sql_help.c
 
 /psql
+/tmp_check
diff --git a/src/bin/psql/Makefile b/src/bin/psql/Makefile
index 69bb297fe7..9473ab01cb 100644
--- a/src/bin/psql/Makefile
+++ b/src/bin/psql/Makefile
@@ -60,8 +60,15 @@ uninstall:
 
 clean distclean:
 	rm -f psql$(X) $(OBJS) lex.backup
+	rm -rf tmp_check
 
 # files removed here are supposed to be in the distribution tarball,
 # so do not clean them in the clean/distclean rules
 maintainer-clean: distclean
 	rm -f sql_help.h sql_help.c psqlscanslash.c
+
+check:
+	$(prove_check)
+
+installcheck:
+	$(prove_installcheck)
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 8254d61099..f6fedb45b7 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -319,7 +319,7 @@ exec_command(const char *cmd,
 		status = exec_command_ef_ev(scan_state, active_branch, query_buf, true);
 	else if (strcmp(cmd, "ev") == 0)
 		status = exec_command_ef_ev(scan_state, active_branch, query_buf, false);
-	else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "qecho") == 0)
+	else if (strcmp(cmd, "echo") == 0 || strcmp(cmd, "qecho") == 0 || strcmp(cmd, "warn") == 0)
 		status = exec_command_echo(scan_state, active_branch, cmd);
 	else if (strcmp(cmd, "elif") == 0)
 		status = exec_command_elif(scan_state, cstack, query_buf);
@@ -1114,7 +1114,7 @@ exec_command_ef_ev(PsqlScanState scan_state, bool active_branch,
 }
 
 /*
- * \echo and \qecho -- echo arguments to stdout or query output
+ * \echo, \qecho, and \warn -- echo arguments to stdout, query output, or stderr
  */
 static backslashResult
 exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd)
@@ -1129,6 +1129,8 @@ exec_command_echo(PsqlScanState scan_state, bool active_branch, const char *cmd)
 
 		if (strcmp(cmd, "qecho") == 0)
 			fout = pset.queryFout;
+

Re: [PATCH v4] Add \warn to psql

2019-05-01 Thread Arthur Zakirov

(Unfortunately I accidentally sent my previous two messages using my personal
email address because of my email client configuration. This address is not
verified by PostgreSQL.org services and messages didn't reach hackers mailing
lists, so I recent latest message).

On Tue, Apr 30, 2019 at 4:46 PM Fabien COELHO  wrote:
> > Unfortunately new TAP test doesn't pass on my machine. I'm not good at Perl
> > and didn't get the reason of the failure quickly.
>
> I guess that you have a verbose ~/.psqlrc.
>
> Can you try with adding -X to psql option when calling psql from the tap
> test?

Ah, true. This patch works for me:

diff --git a/src/bin/psql/t/001_psql.pl b/src/bin/psql/t/001_psql.pl
index 32dd43279b..637baa94c9 100644
--- a/src/bin/psql/t/001_psql.pl
+++ b/src/bin/psql/t/001_psql.pl
@@ -20,7 +20,7 @@ sub psql
  {
 local $Test::Builder::Level = $Test::Builder::Level + 1;
 my ($opts, $stat, $in, $out, $err, $name) = @_;
-   my @cmd = ('psql', split /\s+/, $opts);
+   my @cmd = ('psql', '-X', split /\s+/, $opts);
 $node->command_checks_all(\@cmd, $stat, $out, $err, $name, $in);
 return;
  }

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

62 matches

Mail list logo