Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread davinder singh
On Tue, Apr 28, 2020 at 11:45 PM Juan José Santamaría Flecha <
juanjo.santama...@gmail.com> wrote:

>
> On Tue, Apr 28, 2020 at 5:16 PM davinder singh <
> davindersingh2...@gmail.com> wrote:
>
>> I have tested with different locales with codepages including above.
>> There are few which return different locale code but the error messages in
>> both the cases are the same. I have attached the test and log files.
>>
> But there was one case, where locale code and error messages both are
>> different.
>> Portuguese_Brazil.1252
>>
>> log from [1]
>> 2020-04-28 14:27:39.785 GMT [2284] DEBUG:  IsoLocaleName() executed;
>> locale: "pt"
>> 2020-04-28 14:27:39.787 GMT [2284] ERROR:  division by zero
>> 2020-04-28 14:27:39.787 GMT [2284] STATEMENT:  Select 1/0;
>>
>> log from [2]
>> 2020-04-28 14:36:20.666 GMT [14608] DEBUG:  IsoLocaleName() executed;
>> locale: "pt_BR"
>> 2020-04-28 14:36:20.673 GMT [14608] ERRO:  divisão por zero
>> 2020-04-28 14:36:20.673 GMT [14608] COMANDO:  Select 1/0;
>>
>
> AFAICT, the good result is coming from the new logic.
>
Yes, I also feel the same.

-- 
Regards,
Davinder
EnterpriseDB: http://www.enterprisedb.com


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread davinder singh
On Wed, Apr 29, 2020 at 8:24 AM Amit Kapila  wrote:

> BTW, do you see any different results for pt_PT with create_locale
> version or the new patch version being discussed here?
>
No, there is no difference for pt_PT. The difference you are noticing is
because of the previous locale setting.

-- 
Regards,
Davinder
EnterpriseDB: http://www.enterprisedb.com


Re: Proposing WITH ITERATIVE

2020-04-28 Thread Fabien COELHO


Hello Jonah,

Nice.


-- No ORDER/LIMIT is required with ITERATIVE as only a single tuple is
present
EXPLAIN ANALYZE
WITH ITERATIVE fib_sum (iteration, previous_number, new_number)
 AS (SELECT 1, 0::numeric, 1::numeric
  UNION ALL
 SELECT (iteration + 1), new_number, (previous_number + new_number)
   FROM fib_sum
  WHERE iteration <= 1)
SELECT r.iteration, r.new_number
 FROM fib_sum r;


Nice.

My 0,02€ about the feature design:

I'm wondering about how to use such a feature in the context of WITH query 
with several queries having different behaviors. Currently "WITH" 
introduces a named-query (like a view), "WITH RECURSIVE" introduces a mix 
of recursive and named queries, pg really sees whether each one is 
recursive or not, and "RECURSIVE" is required but could just be guessed.


Now that there could be 3 variants in the mix, and for the feature to be 
orthogonal I think that it should be allowed. However, there is no obvious 
way to distinguish a RECURSIVE from an ITERATIVE, as it is more a 
behavioral thing than a structural one. This suggests allowing to tag each 
query somehow, eg before, which would be closer to the current approach:


  WITH
foo(i) AS (simple select),
RECURSIVE bla(i) AS (recursive select),
ITERATIVE blup(i) AS (iterative select),

or maybe after AS, which may be clearer because closer to the actual 
query, which looks better to me:


  WITH
foo(i) AS (simple select),
bla(i) AS RECURSIVE (recursive select),
boo(i) AS ITERATIVE (iterative select),
…


Also, with 3 cases I would prefer that the default has a name so someone 
can talk about it otherwise than saying "default". Maybe SIMPLE would 
suffice, or something else. ISTM that as nothing is expected between AS 
and the open parenthesis, there is no need to have a reserved keyword, 
which is a good thing for the parser.


--
Fabien.

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-28 Thread Amit Kapila
On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar  wrote:
>
> On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar  wrote:
> > >
> > [latest patches]
> >
> > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > - Any actions leading to transaction ID assignment are prohibited.
> > That, among others,
> > + Note that access to user catalog tables or regular system catalog 
> > tables
> > + in the output plugins has to be done via the
> > systable_* scan APIs only.
> > + Access via the heap_* scan APIs will error out.
> > + Additionally, any actions leading to transaction ID assignment
> > are prohibited. That, among others,
> > ..
> > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> >   bool valid;
> >
> >   /*
> > + * We don't expect direct calls to heap_fetch with valid
> > + * CheckXidAlive for regular tables. Track that below.
> > + */
> > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation
> > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > +
> >
> > I think comments and code don't match.  In the comment, we are saying
> > that via output plugins access to user catalog tables or regular
> > system catalog tables won't be allowed via heap_* APIs but code
> > doesn't seem to reflect it.  I feel only
> > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > original discussion about this point [1] (Refer "I think it'd also be
> > good to add assertions to codepaths not going through systable_*
> > asserting that ...").
>
> Right,  So I think we can just add an assert in these function that
> Assert(!TransactionIdIsValid(CheckXidAlive)) ?
>

I am fine with Assertion but update the documentation accordingly.
However, I think you can once cross-verify if there are any output
plugins that are already using such APIs.  There is a list of "Logical
Decoding Plugins" on the wiki [1], just look into those once.

> >
> > Isn't it better to block the scan to user catalog tables or regular
> > system catalog tables for tableam scan APIs rather than at the heap
> > level?  There might be some APIs like heap_getnext where such a check
> > might still be required but I guess it is still better to block at
> > tableam level.
> >
> > [1] - 
> > https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
>
> Okay, let me analyze this part.  Because someplace we have to keep at
> heap level like heap_getnext and other places at tableam level so it
> seems a bit inconsistent.  Also, I think the number of checks might
> going to increase because some of the heap functions like
> heap_hot_search_buffer are being called from multiple tableam calls,
> so we need to put check at every place.
>
> Another point is that I feel some of the checks what we have today
> might not be required like heap_finish_speculative, is not fetching
> any tuple for us so why do we need to care about this function?
>

Yeah, I don't see the need for such a check (or Assertion) in
heap_finish_speculative.

One additional comment:
---
- Any actions leading to transaction ID assignment are prohibited.
That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the
systable_* scan APIs only.
+ Access via the heap_* scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,

The above text doesn't seem to be aligned properly and you need to
update it if we want to change the error to Assertion for heap APIs

[1] - https://wiki.postgresql.org/wiki/Logical_Decoding_Plugins

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread Amit Kapila
On Wed, Apr 29, 2020 at 8:21 AM Amit Kapila  wrote:
>
> On Wed, Apr 29, 2020 at 1:32 AM Ranier Vilela  wrote:
> >
> > Em ter., 28 de abr. de 2020 às 16:53, Juan José Santamaría Flecha 
> >  escreveu:
> >>
> >>
> >>
> >> On Tue, Apr 28, 2020 at 9:38 PM Ranier Vilela  wrote:
> >>>
> >>> "pt" means portuguese language.
> >>> "pt_BR" means portuguese language from Brazil, "divisão por zero", is 
> >>> correct.
> >>> "pt_PT" means portuguese language from Portugal, "division by zero"? 
> >>> poderia ser "divisão por zero", too.
> >>>
> >>> Why "pt_PT" do not is translated?
> >>
> >>
> >> The translation files are generated as 'pt_BR.po', so this is the expected 
> >> behaviour.
> >>
> >> With my limited knowledge of Portuguese, it makes little sense to have a 
> >> localized version.
> >
> > Well, both are PORTUGUE language, but, do not the same words.
> > pt_PT.po, obviously is missing, I can provide a version, but still, it 
> > wouldn't be 100%, but it's something.
> > Would it be useful?
> >
>
> I am not sure but that doesn't seem to be related to this patch.  If
> it is not related to this patch then it is better to start a separate
> thread (probably on pgsql-translators list).
>

BTW, do you see any different results for pt_PT with create_locale
version or the new patch version being discussed here?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Pulling up sublink may break join-removal logic

2020-04-28 Thread Andy Fan
On Wed, Apr 29, 2020 at 10:37 AM Richard Guo  wrote:

>
> On Wed, Apr 29, 2020 at 8:23 AM David Rowley  wrote:
>
>> On Tue, 28 Apr 2020 at 19:04, Richard Guo  wrote:
>> > I happened to notice $subject and not sure if it's an issue or not. When
>> > we're trying to remove a LEFT JOIN, one of the requirements is the inner
>> > side needs to be a single baserel. If there is a join qual that is a
>> > sublink and can be converted to a semi join with the inner side rel, the
>> > inner side would no longer be a single baserel and as a result the LEFT
>> > JOIN can no longer be removed.
>>
>> I think, in theory at least, that can be fixed by [1], where we no
>> longer rely on looking to see if the RelOptInfo has a unique index to
>> determine if the relation can duplicate outer side rows during the
>> join. Of course, they'll only exist on base relations, so hence the
>> check you're talking about. Instead, the patch's idea is to propagate
>> uniqueness down the join tree in the form of UniqueKeys.
>>
>
> Do you mean we're tracking the uniqueness of each RelOptInfo, baserel or
> joinrel, with UniqueKeys? I like the idea!
>

Yes, it is and welcome the for review for that patch:)


>
>> A quick glance shows there are a few implementation details of join
>> removals of why the removal still won't work with [1].  For example,
>> the singleton rel check causes it to abort both on the pre-check and
>> the final join removal check.  There's also the removal itself that
>> assumes we're just removing a single relation. I'd guess that would
>> need to loop over the min_righthand relids with a bms_next_member loop
>> and remove each base rel one by one.  I'd need to look in more detail
>> to know if there are any other limiting factors there.
>>
>
> Yeah, we'll have to teach remove_useless_joins to work with multiple
> relids.
>

You can see [1] for the discuss for this issue with UniqueKey respect.
search
"In the past we have some limited ability to detect the unqiueness after
join,
 so that's would be ok.  Since we have such ability now,  this may be
another
opportunity to improve the join_is_removable function"

I'm checking it today and will have a feedback soon.

[1]
https://www.postgresql.org/message-id/CAKU4AWrGrs0Vk5OrZmS1gbTA2ijDH18NHKnXZTPZNuupn%2B%2Bing%40mail.gmail.com


Best Regards
Andy Fan


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread Amit Kapila
On Wed, Apr 29, 2020 at 1:32 AM Ranier Vilela  wrote:
>
> Em ter., 28 de abr. de 2020 às 16:53, Juan José Santamaría Flecha 
>  escreveu:
>>
>>
>>
>> On Tue, Apr 28, 2020 at 9:38 PM Ranier Vilela  wrote:
>>>
>>> "pt" means portuguese language.
>>> "pt_BR" means portuguese language from Brazil, "divisão por zero", is 
>>> correct.
>>> "pt_PT" means portuguese language from Portugal, "division by zero"? 
>>> poderia ser "divisão por zero", too.
>>>
>>> Why "pt_PT" do not is translated?
>>
>>
>> The translation files are generated as 'pt_BR.po', so this is the expected 
>> behaviour.
>>
>> With my limited knowledge of Portuguese, it makes little sense to have a 
>> localized version.
>
> Well, both are PORTUGUE language, but, do not the same words.
> pt_PT.po, obviously is missing, I can provide a version, but still, it 
> wouldn't be 100%, but it's something.
> Would it be useful?
>

I am not sure but that doesn't seem to be related to this patch.  If
it is not related to this patch then it is better to start a separate
thread (probably on pgsql-translators list).

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: Pulling up sublink may break join-removal logic

2020-04-28 Thread Richard Guo
On Wed, Apr 29, 2020 at 8:23 AM David Rowley  wrote:

> On Tue, 28 Apr 2020 at 19:04, Richard Guo  wrote:
> > I happened to notice $subject and not sure if it's an issue or not. When
> > we're trying to remove a LEFT JOIN, one of the requirements is the inner
> > side needs to be a single baserel. If there is a join qual that is a
> > sublink and can be converted to a semi join with the inner side rel, the
> > inner side would no longer be a single baserel and as a result the LEFT
> > JOIN can no longer be removed.
>
> I think, in theory at least, that can be fixed by [1], where we no
> longer rely on looking to see if the RelOptInfo has a unique index to
> determine if the relation can duplicate outer side rows during the
> join. Of course, they'll only exist on base relations, so hence the
> check you're talking about. Instead, the patch's idea is to propagate
> uniqueness down the join tree in the form of UniqueKeys.
>

Do you mean we're tracking the uniqueness of each RelOptInfo, baserel or
joinrel, with UniqueKeys? I like the idea!


>
> A quick glance shows there are a few implementation details of join
> removals of why the removal still won't work with [1].  For example,
> the singleton rel check causes it to abort both on the pre-check and
> the final join removal check.  There's also the removal itself that
> assumes we're just removing a single relation. I'd guess that would
> need to loop over the min_righthand relids with a bms_next_member loop
> and remove each base rel one by one.  I'd need to look in more detail
> to know if there are any other limiting factors there.
>

Yeah, we'll have to teach remove_useless_joins to work with multiple
relids.

Thanks
Richard


Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-28 Thread Alvaro Herrera
I pushed this one.  Some closing remarks:

On 2020-Apr-28, Alvaro Herrera wrote:

> On 2020-Apr-28, Kyotaro Horiguchi wrote:

> > Agreed to describe what is failed rather than the cause.  However,
> > logical replications slots are always "previously reserved" at
> > creation.
> 
> Bah, of course.  I was thinking in making the equivalent messages all
> identical in all callsites, but maybe they should be different when
> slots are logical.  I'll go over them again.

I changed the ones that can only be logical slots so that they no longer
say "previously reserved WAL".  The one in
pg_replication_slot_advance still uses that wording, because I didn't
think it was worth creating two separate error paths.

> > ERROR:  replication slot "repl" is not usable to get changes
> 
> That wording seems okay, but my specific point for this error message is
> that we were trying to use a physical slot to get logical changes; so
> the fact that the slot has been invalidated is secondary and we should
> complain about the *type* of slot rather than the restart_lsn.

I moved the check for validity to after CreateDecodingContext, so the
other errors are reported preferently. I also chose a different wording:

/*
 * After the sanity checks in CreateDecodingContext, make sure 
the
 * restart_lsn is valid.  Avoid "cannot get changes" wording in 
this
 * errmsg because that'd be confusingly ambiguous about no 
changes
 * being available.
 */
if (XLogRecPtrIsInvalid(MyReplicationSlot->data.restart_lsn))
ereport(ERROR,

(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 errmsg("can no longer get changes from 
replication slot \"%s\"",
NameStr(*name)),
 errdetail("This slot has never 
previously reserved WAL, or has been invalidated.")));

I hope this is sufficiently clear, but if not, feel free to nudge me and
we can discuss it further.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-28 Thread Andy Fan
On Wed, Apr 29, 2020 at 8:29 AM David Rowley  wrote:

> On Thu, 16 Apr 2020 at 14:17, Andy Fan  wrote:
> > V6 also includes:
> > 1.  Fix the comment misleading you mentioned above.
> > 2.  Fixed a concern case for `relation_has_uniquekeys_for` function.
>
> Over on [1], Richard highlights a problem in the current join removals
> lack of ability to remove left joins unless the min_righthand side of
> the join is a singleton rel. It's my understanding that the reason the
> code checks for this is down to the fact that join removals used
> unique indexed to prove the uniqueness of the relation and obviously,
> those can only exist on base relations.  I wondered if you might want
> to look into a 0003 patch which removes that restriction? I think this
> can be done now since we no longer look at unique indexes to provide
> the proves that the join to be removed won't duplicate outer side
> rows.
>

Yes, I think that would be another benefit of UniqueKey,  but it doesn't
happen
until now.  I will take a look of it today and fix it in a separated
commit.

Best Regards
Andy Fan


Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

2020-04-28 Thread David Rowley
On Thu, 16 Apr 2020 at 14:17, Andy Fan  wrote:
> V6 also includes:
> 1.  Fix the comment misleading you mentioned above.
> 2.  Fixed a concern case for `relation_has_uniquekeys_for` function.

Over on [1], Richard highlights a problem in the current join removals
lack of ability to remove left joins unless the min_righthand side of
the join is a singleton rel. It's my understanding that the reason the
code checks for this is down to the fact that join removals used
unique indexed to prove the uniqueness of the relation and obviously,
those can only exist on base relations.  I wondered if you might want
to look into a 0003 patch which removes that restriction? I think this
can be done now since we no longer look at unique indexes to provide
the proves that the join to be removed won't duplicate outer side
rows.

David

[1] 
https://www.postgresql.org/message-id/CAMbWs4-THacv3DdMpiTrvg5ZY7sNViFF1pTU=kokmtpbre9...@mail.gmail.com




Re: Pulling up sublink may break join-removal logic

2020-04-28 Thread David Rowley
On Tue, 28 Apr 2020 at 19:04, Richard Guo  wrote:
> I happened to notice $subject and not sure if it's an issue or not. When
> we're trying to remove a LEFT JOIN, one of the requirements is the inner
> side needs to be a single baserel. If there is a join qual that is a
> sublink and can be converted to a semi join with the inner side rel, the
> inner side would no longer be a single baserel and as a result the LEFT
> JOIN can no longer be removed.

I think, in theory at least, that can be fixed by [1], where we no
longer rely on looking to see if the RelOptInfo has a unique index to
determine if the relation can duplicate outer side rows during the
join. Of course, they'll only exist on base relations, so hence the
check you're talking about. Instead, the patch's idea is to propagate
uniqueness down the join tree in the form of UniqueKeys.

A quick glance shows there are a few implementation details of join
removals of why the removal still won't work with [1].  For example,
the singleton rel check causes it to abort both on the pre-check and
the final join removal check.  There's also the removal itself that
assumes we're just removing a single relation. I'd guess that would
need to loop over the min_righthand relids with a bms_next_member loop
and remove each base rel one by one.  I'd need to look in more detail
to know if there are any other limiting factors there.

I'll link back here over on that thread to see if Andy would like to
take a quick look at it.

David

[1] 
https://www.postgresql.org/message-id/caku4awoymsjbm5kdybsko13gufg57px1qyc3y8sdhyrfoqe...@mail.gmail.com




Re: pg_rewind docs correction

2020-04-28 Thread Michael Paquier
On Tue, Apr 28, 2020 at 12:13:38PM -0400, James Coleman wrote:
> I think is missing a word. Instead of "especially the the target"
> should be "especially if the target".

Thanks, fixed.

> In this block:
> 
> +  Create a backup_label file to begin WAL replay at
> +  the checkpoint created at failover and configure the
> +  pg_control file with a minimum consistency LSN
> +  defined as the result of pg_current_wal_insert_lsn()
> +  when rewinding from a live source and using the last checkpoint LSN
> +  when rewinding from a stopped source.
> + 
> 
> Perhaps change "and using the last checkpoint LSN" to "or the last
> checkpoint LSN". Alternatively you could make the grammar parallel by
> changing to "and defined as the last checkpoint LSN", but that seems
> wordy, and the "defined as [item or item]" is already a good grammar
> construction.

Using your first suggestion of "or the last checkpoint LSN" sounds
more natural as of this morning, so updated the patch with that.

I am letting that aside for a couple of days to see if others have
more comments, and will likely commit it after an extra lookup.
--
Michael
From 48b9e85fe9ae414a0f42881558ab2fd920ba0c60 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Wed, 29 Apr 2020 09:11:28 +0900
Subject: [PATCH v6] Improve pg_rewind explanation and warnings

The pg_rewind docs currently assert that the state of the target's
data directory after rewind is equivalent to the source's data
directory. But that isn't quite true both because the base state is
further back in time and because the target's data directory will
include the current state on the source of any copied blocks.
Instead using the analogy to back backups helps explain the state,
as well as the pros and cons of using the utility.

The How It Works section now:
- Includes details about how the backup_label file is created.
- Includes details about how the pg_control file is updated.
- Is updated to include WAL segments and new relation files in the
  list of files copied wholesale from the source.

Finally, document clearly the state of the cluster after the operation
and also the operation sequencing dangers caused by copying
configuration files from the source.
---
 doc/src/sgml/ref/pg_rewind.sgml | 90 +
 1 file changed, 57 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 07c49e4719..391e5db2e2 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -48,14 +48,16 @@ PostgreSQL documentation
   
 
   
-   The result is equivalent to replacing the target data directory with the
-   source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of pg_rewind over taking a new base backup, or
-   tools like rsync, is that pg_rewind does
-   not require reading through unchanged blocks in the cluster. This makes
-   it a lot faster when the database is large and only a small
-   fraction of blocks differ between the clusters.
+   After a successful rewind, the state of the target data directory is
+   analogous to a base backup of the source data directory. Unlike taking
+   a new base backup or using a tool like rsync,
+   pg_rewind does not require comparing or copying
+   unchanged relation blocks in the cluster. Only changed blocks from existing
+   relation files are copied; all other files, including new relation files,
+   configuration files, and WAL segments, are copied in full. As such the
+   rewind operation is significantly faster than other approaches when the
+   database is large and only a small fraction of blocks differ between the
+   clusters.
   
 
   
@@ -77,16 +79,18 @@ PostgreSQL documentation
   
 
   
-   When the target server is started for the first time after running
-   pg_rewind, it will go into recovery mode and replay all
-   WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   pg_rewind was run, and therefore could not be copied by the
-   pg_rewind session, it must be made available when the
-   target server is started. This can be done by creating a
-   recovery.signal file in the target data directory
-   and configuring suitable 
-   in postgresql.conf.
+   After running pg_rewind, WAL replay needs to
+   complete for the data directory to be in a consistent state. When the
+   target server is started again it will enter archive recovery and replay
+   all WAL generated in the source server from the last checkpoint before
+   the point of divergence. If some of the WAL was no longer available in the
+   source server when pg_rewind was run, and
+   therefore could not be copied by the pg_rewind
+   session, it must be made available when the target server is started.
+   This can be done by creating a recovery.signal file
+   in 

Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-28 Thread Alvaro Herrera
On 2020-Apr-28, Kyotaro Horiguchi wrote:

> > Anyway I think this patch should fix it also -- instead of adding a new
> > flag, we just rely on the existing flags (since do_checkpoint must have
> > been set correctly from the flags earlier in that block.)
> 
> Since the added (!do_checkpoint) check is reached with
> do_checkpoint=false at server start and at archive_timeout intervals,
> the patch makes checkpointer run a busy-loop at that timings, and that
> loop lasts until a checkpoint is actually executed.
> 
> What we need to do here is not forgetting the fact that the latch has
> been set even if the latch itself gets reset before reaching to
> WaitLatch.

After a few more false starts :-) I think one easy thing we can do
without the additional boolean flag is to call SetLatch there in the
main loop if we see that ckpt_flags is nonzero.

(I had two issues with the boolean flag.  One is that the comment in
ReqCheckpointHandler needed an update to, essentially, say exactly the
opposite of what it was saying; such a change was making me very
uncomfortable.  The other is that the place where the flag was reset in
CheckpointerMain() was ... not really appropriate; or it could have been
appropriate if the flag was called, say, "CheckpointerMainNoSleepOnce".
Because "RequestPending" was the wrong name to use, because if the flag
was for really request pending, then we should reset it inside the "if
do_checkpoint" block .. but as I understand this would cause the
busy-loop behavior you described.)

> The attached patch on 019_replslot_limit.pl does the commands above
> automatically. It sometimes succeed but fails in most cases, at least
> for me.

With the additional SetLatch, the test passes reproducibly for me.
Before the patch, it failed ten out of ten times I ran it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 74751b6a3a049ff83c6bef99e2e39562278a7ba6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 27 Apr 2020 19:35:15 -0400
Subject: [PATCH] Fix checkpoint signalling

Because the checkpointer process now uses its MyLatch to wait for
walsenders to go away (per commit c6550776394e), we must ensure to nudge
it when going to sleep.
---
 src/backend/postmaster/checkpointer.c |  8 +
 src/test/recovery/t/019_replslot_limit.pl | 44 +--
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index e354a78725..94e0161162 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -494,6 +494,14 @@ CheckpointerMain(void)
 		 */
 		pgstat_send_bgwriter();
 
+		/*
+		 * If any checkpoint flags have been set, nudge our latch so that the
+		 * wait below will return immediately, even if a latch signal was
+		 * consumed elsewhere.
+		 */
+		if (((volatile CheckpointerShmemStruct *) CheckpointerShmem)->ckpt_flags)
+			SetLatch(MyLatch);
+
 		/*
 		 * Sleep until we are signaled or it's time for another checkpoint or
 		 * xlog file switch.
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 32dce54522..634f2bec8b 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -8,7 +8,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 13;
+use Test::More tests => 14;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -179,7 +179,47 @@ for (my $i = 0; $i < 1; $i++)
 }
 ok($failed, 'check that replication has been broken');
 
-$node_standby->stop;
+$node_master->stop('immediate');
+$node_standby->stop('immediate');
+
+my $node_master2 = get_new_node('master2');
+$node_master2->init(allows_streaming => 1);
+$node_master2->append_conf(
+	'postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 32MB
+log_checkpoints = yes
+));
+$node_master2->start;
+$node_master2->safe_psql('postgres',
+	"SELECT pg_create_physical_replication_slot('rep1')");
+$backup_name = 'my_backup2';
+$node_master2->backup($backup_name);
+
+$node_master2->stop;
+$node_master2->append_conf(
+	'postgresql.conf', qq(
+max_slot_wal_keep_size = 0
+));
+$node_master2->start;
+
+$node_standby = get_new_node('standby_2');
+$node_standby->init_from_backup($node_master2, $backup_name,
+	has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+$node_standby->start;
+my @result =
+  split(
+	'\n',
+	$node_master2->safe_psql(
+		'postgres',
+		"CREATE TABLE tt();
+		 DROP TABLE tt;
+		 SELECT pg_switch_wal();
+		 CHECKPOINT;
+		 SELECT 'finished';",
+		timeout => '60'));
+is($result[1], 'finished', 'check if checkpoint command is not blocked');
 
 #
 # Advance WAL of $node by $n segments
-- 
2.20.1



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Andres Freund
Hi,

On 2020-04-28 18:22:20 -0400, James Coleman wrote:
> I cc'd Andres given his commit introduced simplehash, so I figured
> he'd probably have a few pointers on when each one might be useful.
> [...]
> Do you have any thoughts on what the trade-offs/use-cases etc. are for
> dynahash versus simple hash?
> 
> From reading the commit message in b30d3ea824c it seems like simple
> hash is faster and optimized for CPU cache benefits. The comments at
> the top of simplehash.h also discourage it's use in non
> performance/space sensitive uses, but there isn't anything I can see
> that explicitly tries to discuss when dynahash is useful, etc.

Benefits of dynahash (chained hashtable):
- supports partitioning, useful for shared memory accessed under locks
- better performance for large entries, as they don't need to be moved
  around in case of hash conflicts
- stable pointers to hash table entries

Benefits of simplehash (open addressing hash table):
- no indirect function calls, known structure sizes, due to "templated"
  code generation (these show up substantially in profiles for dynahash)
- considerably faster for small entries due to previous point, and due
  open addressing hash tables having better cache behaviour than chained
  hashtables
- once set-up the interface is type safe and easier to use
- no overhead of a separate memory context etc


> Given the performance notes in that commit message, I thinking
> switching to simple hash is worth it.

Seems plausible to me.


> But I also wonder if there might be some value in a README or comments
> addition that would be a guide to what the various hash
> implementations are useful for. If there's interest, I could try to
> type something short up so that we have something to make the code
> base a bit more discoverable.

That'd make sense to me.

Greetings,

Andres Freund




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Tomas Vondra

On Tue, Apr 28, 2020 at 06:22:20PM -0400, James Coleman wrote:

I cc'd Andres given his commit introduced simplehash, so I figured
he'd probably have a few pointers on when each one might be useful.

On Tue, Apr 28, 2020 at 8:39 AM James Coleman  wrote:
...

> Any particular reasons to pick dynahash over simplehash? ISTM we're
> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> while there are not many places using dynahash for simple short-lived
> hash tables. Of course, that alone is a weak reason to insist on using
> simplehash here, but I suppose there were reasons for not using dynahash
> and we'll end up facing the same issues here.

No particular reason; it wasn't clear to me that there was a reason to
prefer one or the other (and I'm not acquainted with the codebase
enough to know the differences), so I chose dynahash because it was
easier to find examples to follow for implementation.


Do you have any thoughts on what the trade-offs/use-cases etc. are for
dynahash versus simple hash?

From reading the commit message in b30d3ea824c it seems like simple
hash is faster and optimized for CPU cache benefits. The comments at
the top of simplehash.h also discourage it's use in non
performance/space sensitive uses, but there isn't anything I can see
that explicitly tries to discuss when dynahash is useful, etc.

Given the performance notes in that commit message, I thinking
switching to simple hash is worth it.



I recall doing some benchmarks for that patch, but it's so long I don't
really remember the details. But in general, I agree simplehash is a bit
more efficient in terms of CPU / caching.

I think the changes required to switch from dynahash to simplehash are
fairly small, so I think the best thing we can do is just try do some
measurement and then decide.


But I also wonder if there might be some value in a README or comments
addition that would be a guide to what the various hash
implementations are useful for. If there's interest, I could try to
type something short up so that we have something to make the code
base a bit more discoverable.



I wouldn't object to that. Although maybe we should simply add some
basic recommendations to the comments in dynahash/simplehash.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread James Coleman
I cc'd Andres given his commit introduced simplehash, so I figured
he'd probably have a few pointers on when each one might be useful.

On Tue, Apr 28, 2020 at 8:39 AM James Coleman  wrote:
...
> > Any particular reasons to pick dynahash over simplehash? ISTM we're
> > using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> > while there are not many places using dynahash for simple short-lived
> > hash tables. Of course, that alone is a weak reason to insist on using
> > simplehash here, but I suppose there were reasons for not using dynahash
> > and we'll end up facing the same issues here.
>
> No particular reason; it wasn't clear to me that there was a reason to
> prefer one or the other (and I'm not acquainted with the codebase
> enough to know the differences), so I chose dynahash because it was
> easier to find examples to follow for implementation.

Do you have any thoughts on what the trade-offs/use-cases etc. are for
dynahash versus simple hash?

>From reading the commit message in b30d3ea824c it seems like simple
hash is faster and optimized for CPU cache benefits. The comments at
the top of simplehash.h also discourage it's use in non
performance/space sensitive uses, but there isn't anything I can see
that explicitly tries to discuss when dynahash is useful, etc.

Given the performance notes in that commit message, I thinking
switching to simple hash is worth it.

But I also wonder if there might be some value in a README or comments
addition that would be a guide to what the various hash
implementations are useful for. If there's interest, I could try to
type something short up so that we have something to make the code
base a bit more discoverable.

James




Re: More efficient RI checks - take 2

2020-04-28 Thread Andres Freund
Hi,

On 2020-04-28 10:44:58 -0400, Tom Lane wrote:
> Stephen Frost  writes:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> >> As you say, perhaps there's room for both things, but also as you say,
> >> it's not obvious how to decide intelligently between them.
>
> > The single-row case seems pretty clear and also seems common enough that
> > it'd be worth paying the cost to figure out if it's a single-row
> > statement or not.

It's not that obvious to me that it's going to be beneficial to go down
a planned path in all that many cases. If all that the RI check does is
a index_rescan() && index_getnext_slot(), there's not that many
realistic types of plans that are going to be better.

IIUC a query to check a transition table would, simplified, boil down to
either:

SELECT * FROM transition_table tt
WHERE
-- for a insertion/update into the referencing table and
(
NOT EXISTS (SELECT * FROM referenced_table rt WHERE 
rt.referenced_column = tt.referencing_column)
[AND ... , ]
)
-- for update / delete of referenced table
OR EXISTS (SELECT * FROM referencing_table rt1 WHERE rt1.referencing_column 
= tt.referenced_column1)
[OR ... , ]
LIMIT 1;

Where a returned row would signal an error. But it would need to handle
row locking, CASCADE/SET NULL/SET DEFAULT etc. More on that below.

While it's tempting to want to write the latter check as

-- for update / delete of referenced table
SELECT * FROM referencing_table rt
WHERE referencing_column IN (SELECT referenced_column FROM transition_table tt)
LIMIT 1;
that'd make it harder to know the violating row.


As the transition table isn't ordered it seems like there's not that
many realistic ways to execute this:

1) A nestloop semi/anti-join with an inner index scan
2) Sort transition table, do a merge semi/anti-join between sort and an
  ordered index scan on the referenced / referencing table(s).
3) Hash semi/anti-join, requiring a full table scan of the tables


I think 1) would be worse than just doing the indexscan manually. 2)
would probably be beneficial if there's a lot of rows on the inner side,
due to the ordered access and deduplication. 3) would sometimes be
beneficial because it'd avoid an index scan for each tuple in the
transition table.

The cases in which it is clear to me that a bulk check could
theoretically be significantly better than a fast per-row check are:

1) There's a *lot* of rows in the transition table to comparatively small
  referenced / referencing tables. As those tables can cheaply be
  hashed, a hashjoin will be be more efficient than doing a index lookup
  for each transition table row.
2) There's a lot of duplicate content in the transition
  table. E.g. because there's a lot of references to the same row.

Did I forget important ones?


With regard to the row locking etc that I elided above: I think that
actually will prevent most if not all interesting optimizations: Because
of the FOR KEY SHARE that's required, the planner plan will pretty much
always degrade to a per row subplan check anyway.  Is there any
formulation of the necessary queries that don't suffer from this
problem?


> That seems hard to do in advance ... but it would be easy to code
> a statement-level AFTER trigger along the lines of
>
>   if (transition table contains one row)
>   // fast special case here
>   else
>   // slow general case here.

I suspect we'd need something more complicated than this for it to be
beneficial. My gut feeling would be that the case where a transition
table style check would be most commonly beneficial is when you have a
very small referenced table, and a *lot* of rows get inserted.  But
clearly we wouldn't want to have bulk insertion suddenly also store all
rows in a transition table.

Nor would we want to have a bulk UPDATE cause all the updated rows to be
stored in the transition table, even though none of the relevant columns
changed (i.e. the RI_FKey_[f]pk_upd_check_required logic in
AfterTriggerSaveEvent()).


I still don't quite see how shunting RI checks through triggers saves us
more than it costs:

Especially for the stuff we do as AFTER: Most of the time we could do
the work we defer till query end immediately, rather than building up an
in-memory queue. Besides saving memory, in a lot of cases that'd also
make it unnecessary to refetch the row at a later time, possibly needing
to chase updated row versions.

But even for the BEFORE checks, largely going through generic trigger
code means it's much harder to batch checks without suddenly requiring
memory proportional to the number of inserted rows.


There obviously are cases where it's not possible to check just after
each row. Deferrable constraints, as well as CASCADING / SET NOT NULL /
SET DEFAULT on tables with user defined triggers, for example. But it'd
likely be sensible to handle that in the way we already handle deferred
uniqueness checks, i.e. we only queue something if there's a potential

Re: Poll: are people okay with function/operator table redesign?

2020-04-28 Thread Tom Lane
I wrote:
> One thing I couldn't help noticing while fooling with this is what
> seems to be a bug in the PDF toolchain: any place you try to put
> an , you get extra whitespace.  Not a lot, but once you
> see it, you can't un-see it.  It's particularly obvious if one of
> two adjacent lines has the extra indentation and the other doesn't.
> ...
> The only "fix" I've found is to place the  at the end
> of the signature  instead of the beginning.

I spent some more time experimenting with this today, and determined
that there's no way to fix it by messing with FO layout attributes.
The basic problem seems to be that if you write

   

 ceiling

ceiling ( numeric )

then what you get in the .fo file is



ceiling ( numeric )

where the  apparently is used as a cross-reference anchor.
The trouble with this is that the rules for collapsing adjacent whitespace
don't work across the , so no matter what you do you will end
up with two spaces not one before the visible text "ceiling".  The only
way to hide the effects of that with layout attributes is to set
whitespace to be ignored altogether within the block, which is quite
undesirable.

The fix I'm currently considering is to eliminate the extra whitespace
run(s) by formatting s within tables this way:

  
   
 char_length

 character string
 length

 length
 of a character string
 character string, length

char_length ( text )
integer
   

Perhaps it's only worth being anal about this in table cells with multiple
function signatures and/or multiple s; in other places the
whitespace variation just isn't that noticeable.  On the other hand,
there's something to be said for having uniform layout of the XML source,
which'd suggest having a uniform rule "no whitespace before an 
within a table cell".

Or we could put the s at the end.  Or just ignore it, reasoning
that the PDF output is never going to look all that great anyway.

Thoughts?

regards, tom lane




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread Juan José Santamaría Flecha
On Tue, Apr 28, 2020 at 9:38 PM Ranier Vilela  wrote:

> "pt" means portuguese language.
> "pt_BR" means portuguese language from Brazil, "divisão por zero", is
> correct.
> "pt_PT" means portuguese language from Portugal, "division by zero"?
> poderia ser "divisão por zero", too.
>
> Why "pt_PT" do not is translated?
>

The translation files are generated as 'pt_BR.po', so this is the expected
behaviour.

With my limited knowledge of Portuguese, it makes little sense to have a
localized version.

Regards,

Juan José Santamaría Flecha


Re: Proposing WITH ITERATIVE

2020-04-28 Thread Jonah H. Harris
On Mon, Apr 27, 2020 at 8:50 PM Jeff Davis  wrote:

> Can you illustrate with some examples? I get that one is appending and
> the other is modifying in-place, but how does this end up looking in
> the query language?
>

To ensure credit is given to the original authors, the initial patch I'm
working with (against Postgres 11 and 12) came from Denis Hirn, Torsten
Grust, and Christian Duta. Attached is a quick-and-dirty merge of that
patch that applies cleanly against 13-devel. It is not solid, at all, but
demonstrates the functionality. At present, my updates can be found at
https://github.com/jonahharris/postgres/tree/with-iterative

As the patch makes use of additional booleans for iteration, if there's
interest in incorporating this functionality, I'd like to discuss with Tom,
Robert, et al regarding the current use of booleans for CTE recursion
differentiation in parsing and planning. In terms of making it
production-ready, I think the cleanest method would be to use a bitfield
(rather than booleans) to differentiate recursive and iterative CTEs.
Though, as that would touch more code, it's obviously up for discussion.

I'm working on a few good standalone examples of PageRank, shortest path,
etc. But, the simplest Fibonacci example can be found here:

EXPLAIN ANALYZE
WITH RECURSIVE fib_sum (iteration, previous_number, new_number)
  AS (SELECT 1, 0::numeric, 1::numeric
   UNION ALL
  SELECT (iteration + 1), new_number, (previous_number + new_number)
FROM fib_sum
   WHERE iteration <= 1)
SELECT r.iteration, r.new_number
  FROM fib_sum r
 ORDER BY 1 DESC
 LIMIT 1;

QUERY PLAN

--
 Limit  (cost=3.81..3.81 rows=1 width=36) (actual time=44.418..44.418
rows=1 loops=1)
   CTE fib_sum
 ->  Recursive Union  (cost=0.00..3.03 rows=31 width=68) (actual
time=0.005..14.002 rows=10001 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual
time=0.004..0.004 rows=1 loops=1)
   ->  WorkTable Scan on fib_sum  (cost=0.00..0.24 rows=3 width=68)
(actual time=0.001..0.001 rows=1 loops=10001)
 Filter: (iteration <= 1)
 Rows Removed by Filter: 0
   ->  Sort  (cost=0.78..0.85 rows=31 width=36) (actual time=44.417..44.417
rows=1 loops=1)
 Sort Key: r.iteration DESC
 Sort Method: top-N heapsort  Memory: 27kB
 ->  CTE Scan on fib_sum r  (cost=0.00..0.62 rows=31 width=36)
(actual time=0.009..42.660 rows=10001 loops=1)
 Planning Time: 0.331 ms
 Execution Time: 45.949 ms
(13 rows)

-- No ORDER/LIMIT is required with ITERATIVE as only a single tuple is
present
EXPLAIN ANALYZE
WITH ITERATIVE fib_sum (iteration, previous_number, new_number)
  AS (SELECT 1, 0::numeric, 1::numeric
   UNION ALL
  SELECT (iteration + 1), new_number, (previous_number + new_number)
FROM fib_sum
   WHERE iteration <= 1)
SELECT r.iteration, r.new_number
  FROM fib_sum r;

QUERY PLAN

--
 CTE Scan on fib_sum r  (cost=3.03..3.65 rows=31 width=36) (actual
time=11.626..11.627 rows=1 loops=1)
   CTE fib_sum
 ->  Recursive Union  (cost=0.00..3.03 rows=31 width=68) (actual
time=11.621..11.621 rows=1 loops=1)
   ->  Result  (cost=0.00..0.01 rows=1 width=68) (actual
time=0.001..0.001 rows=1 loops=1)
   ->  WorkTable Scan on fib_sum  (cost=0.00..0.24 rows=3 width=68)
(actual time=0.001..0.001 rows=1 loops=10001)
 Filter: (iteration <= 1)
 Rows Removed by Filter: 0
 Planning Time: 0.068 ms
 Execution Time: 11.651 ms
(9 rows)


-- 
Jonah H. Harris


with_iterative_v1.patch
Description: Binary data


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread James Coleman
On Tue, Apr 28, 2020 at 1:40 PM Pavel Stehule  wrote:
>
>
>
> út 28. 4. 2020 v 18:17 odesílatel James Coleman  napsal:
>>
>> On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule  
>> wrote:
>> >
>> >
>> >
>> > út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra 
>> >  napsal:
>> >>
>> >> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra 
>> >> >
>> >> >napsal:
>> >> >
>> >> >> ...
>> >> >>
>> >> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >> >clause because even if we end up with only one row, the relative
>> >> >> >penalty is low, and the potential gain is very high. But simple
>> >> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >> >expression once, which means there's probably always a penalty for
>> >> >> >choosing the implementation with setup costs over the default linear
>> >> >> >scan through the array.
>> >> >> >
>> >> >>
>> >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> >> disable this optimization for now.
>> >> >>
>> >> >
>> >> >a := a + 1
>> >> >
>> >> >is translated to
>> >> >
>> >> >SELECT $1 + 1 and save result to var a
>> >> >
>> >> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >> >evaluated just on executor level - it skip SPI
>> >> >
>> >> >the simple expression has not FROM clause, and have to return just one 
>> >> >row.
>> >> >I am not sure if it is required, it has to return just one column.
>>
>> Yes, this is what I meant by simple expressions.
>>
>> >> >I am not sure if executor knows so expression is executed as simply
>> >> >expressions. But probably it can be deduced from context
>> >> >
>> >>
>> >> Not sure. The executor state is created by exec_eval_simple_expr, which
>> >> calls ExecInitExprWithParams (and it's the only caller). And that in
>> >> turn is the only place that leaves (state->parent == NULL). So maybe
>> >> that's a way to identify simple (standalone) expressions? Otherwise we
>> >> might add a new EEO_FLAG_* to identify these expressions explicitly.
>>
>> I'll look into doing one of these.
>>
>> >> I wonder if it would be possible to identify cases when the expression
>> >> is executed in a loop, e.g. like this:
>> >>
>> >>  FOR i IN 1..1000 LOOP
>> >>  x := y IN (1, 2, ..., 999);
>> >>  END LOOP;
>> >>
>> >> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> >> could keep the hash table for all executions. But maybe that's not
>> >> possible or maybe it's pointless for other reasons. It sure looks a bit
>> >> like trying to build a query engine from FOR loop.
>> >
>> >
>> > Theoretically it is possible, not now. But I don't think so it is 
>> > necessary. I cannot to remember this pattern in any plpgsql code and I 
>> > never seen any request on this feature.
>> >
>> > I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.
>>
>> Agreed. I'd thought about this kind of scenario when I brought this
>> up, but I think solving it would the responsibility of the pg/pgsql
>> compiler rather than the expression evaluation code, because it'd have
>> to recognize the situation and setup a shared expression evaluation
>> context to be reused each time through the loop.
>
>
> can be nice if new implementation was not slower then older in all 
> environments and context (including plpgsql expressions)

Agreed, which is why I'm going to look into preventing using the new
code path for simple expressions.

James




Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread Juan José Santamaría Flecha
On Tue, Apr 28, 2020 at 5:16 PM davinder singh 
wrote:

> On Mon, Apr 27, 2020 at 4:50 PM Amit Kapila 
> wrote:
>
>> Bemba_Zambia
>> Bena_Tanzania
>> Bulgarian_Bulgaria
>> Swedish_Sweden.1252
>> Swedish_Sweden
>>
>
> I have tested with different locales with codepages including above. There
> are few which return different locale code but the error messages in both
> the cases are the same. I have attached the test and log files.
>
But there was one case, where locale code and error messages both are
> different.
> Portuguese_Brazil.1252
>
> log from [1]
> 2020-04-28 14:27:39.785 GMT [2284] DEBUG:  IsoLocaleName() executed;
> locale: "pt"
> 2020-04-28 14:27:39.787 GMT [2284] ERROR:  division by zero
> 2020-04-28 14:27:39.787 GMT [2284] STATEMENT:  Select 1/0;
>
> log from [2]
> 2020-04-28 14:36:20.666 GMT [14608] DEBUG:  IsoLocaleName() executed;
> locale: "pt_BR"
> 2020-04-28 14:36:20.673 GMT [14608] ERRO:  divisão por zero
> 2020-04-28 14:36:20.673 GMT [14608] COMANDO:  Select 1/0;
>

AFAICT, the good result is coming from the new logic.


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread Juan José Santamaría Flecha
On Mon, Apr 27, 2020 at 1:20 PM Amit Kapila  wrote:

> I think we should backpatch this till 9.5 as I could see the changes
> made by commit 0fb54de9 to support MSVC2015 are present in that branch
> and the same is mentioned in the commit message.  Would you like to
> prepare patches (and test those) for back-branches?
>

I do not have means to test these patches using Visual Studio previous to
2012, but please find attached patches for 9.5-9.6 and 10-11-12 as of
version 14. The extension is 'txt' not to break the cfbot.


> I have made few cosmetic changes in the attached patch which includes
> adding/editing a few comments, ran pgindent, etc.  I have replaced the
> reference of "IETF-standardized" with "Unix-style" as we are already
> using it at other places in the comments as well.


LGTM.

Regards,
Juan José Santamaría Flecha

>
>
diff --git a/src/backend/utils/adt/pg_locale.c 
b/src/backend/utils/adt/pg_locale.c
index b9e01b6..10eb0ab 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -220,6 +220,7 @@ pg_perm_setlocale(int category, const char *locale)
result = IsoLocaleName(locale);
if (result == NULL)
result = (char *) locale;
+   elog(DEBUG3, "IsoLocaleName() executed; locale: 
\"%s\"", result);
 #endif /* WIN32 */
break;
 #endif /* LC_MESSAGES */
@@ -953,25 +954,106 @@ cache_locale_time(void)
  * string.  Furthermore, msvcr110.dll changed the undocumented _locale_t
  * content to carry locale names instead of locale identifiers.
  *
- * MinGW headers declare _create_locale(), but msvcrt.dll lacks that symbol.
- * IsoLocaleName() always fails in a MinGW-built postgres.exe, so only
- * Unix-style values of the lc_messages GUC can elicit localized messages.  In
- * particular, every lc_messages setting that initdb can select automatically
- * will yield only C-locale messages.  XXX This could be fixed by running the
- * fully-qualified locale name through a lookup table.
+ * Visual Studio 2015 should still be able to do the same as Visual Studio
+ * 2012, but the declaration of locale_name is missing in _locale_t, causing
+ * this code compilation to fail, hence this falls back instead on to
+ * enumerating all system locales by using EnumSystemLocalesEx to find the
+ * required loacle name.  If the input argument is in Unix-style then we can
+ * get ISO Locale name directly by using GetLocaleInfoEx() with LCType as
+ * LOCALE_SNAME.
+ *
+ * MinGW headers declare _create_locale(), but msvcrt.dll lacks that symbol in
+ * releases before Windows 8. IsoLocaleName() always fails in a MinGW-built
+ * postgres.exe, so only Unix-style values of the lc_messages GUC can elicit
+ * localized messages. In particular, every lc_messages setting that initdb
+ * can select automatically will yield only C-locale messages. XXX This could
+ * be fixed by running the fully-qualified locale name through a lookup table.
  *
  * This function returns a pointer to a static buffer bearing the converted
  * name or NULL if conversion fails.
  *
- * [1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd373763.aspx
- * [2] http://msdn.microsoft.com/en-us/library/windows/desktop/dd373814.aspx
+ * [1] https://docs.microsoft.com/en-us/windows/win32/intl/locale-identifiers
+ * [2] https://docs.microsoft.com/en-us/windows/win32/intl/locale-names
  */
+
+#if _MSC_VER >= 1900
+/*
+ * Callback function for EnumSystemLocalesEx() in IsoLocaleName().
+ *
+ * This search is done by enumerating all system locales, trying to match a
+ * locale with the format: [_], e.g. English[_United States]
+ *
+ * The input is a three wchar_t array as an LPARAM. The first element is the
+ * locale_name we want to match, the second element is an allocated buffer
+ * where the Unix-style locale is copied if a match is found, and the third
+ * element is the search status, 1 if a match was found, 0 otherwise.
+ */
+static BOOL CALLBACK
+search_locale_enum(LPWSTR pStr, DWORD dwFlags, LPARAM lparam)
+{
+   wchar_t test_locale[LOCALE_NAME_MAX_LENGTH];
+   wchar_t   **argv;
+
+   (void) (dwFlags);
+
+   argv = (wchar_t **) lparam;
+   *argv[2] = (wchar_t) 0;
+
+   memset(test_locale, 0, sizeof(test_locale));
+
+   /* Get the name of the  in English */
+   if (GetLocaleInfoEx(pStr, LOCALE_SENGLISHLANGUAGENAME,
+   test_locale, 
LOCALE_NAME_MAX_LENGTH))
+   {
+   /*
+* If the enumerated locale does not have a hyphen ("en") OR  
the
+* lc_message input does not have an underscore ("English"), we 
only
+* need to compare the  tags.
+*/
+   if (wcsrchr(pStr, '-') == NULL || wcsrchr(argv[0], '_') == NULL)
+   {
+  

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Pavel Stehule
út 28. 4. 2020 v 18:17 odesílatel James Coleman  napsal:

> On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule 
> wrote:
> >
> >
> >
> > út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra <
> tomas.von...@2ndquadrant.com> napsal:
> >>
> >> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
> >> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> >> >napsal:
> >> >
> >> >> ...
> >> >>
> >> >> >I'm not so concerned about this in any query where we have a real
> FROM
> >> >> >clause because even if we end up with only one row, the relative
> >> >> >penalty is low, and the potential gain is very high. But simple
> >> >> >expressions in pl/pgsql, for example, are a case where we can know
> for
> >> >> >certain (correct me if I've wrong on this) that we'll only execute
> the
> >> >> >expression once, which means there's probably always a penalty for
> >> >> >choosing the implementation with setup costs over the default linear
> >> >> >scan through the array.
> >> >> >
> >> >>
> >> >> What do you mean by "simple expressions"? I'm not plpgsql expert and
> I
> >> >> see it mostly as a way to glue together SQL queries, but yeah - if we
> >> >> know a given ScalarArrayOpExpr will only be executed once, then we
> can
> >> >> disable this optimization for now.
> >> >>
> >> >
> >> >a := a + 1
> >> >
> >> >is translated to
> >> >
> >> >SELECT $1 + 1 and save result to var a
> >> >
> >> >The queries like this "SELECT $1 + 1" are simple expressions. They are
> >> >evaluated just on executor level - it skip SPI
> >> >
> >> >the simple expression has not FROM clause, and have to return just one
> row.
> >> >I am not sure if it is required, it has to return just one column.
>
> Yes, this is what I meant by simple expressions.
>
> >> >I am not sure if executor knows so expression is executed as simply
> >> >expressions. But probably it can be deduced from context
> >> >
> >>
> >> Not sure. The executor state is created by exec_eval_simple_expr, which
> >> calls ExecInitExprWithParams (and it's the only caller). And that in
> >> turn is the only place that leaves (state->parent == NULL). So maybe
> >> that's a way to identify simple (standalone) expressions? Otherwise we
> >> might add a new EEO_FLAG_* to identify these expressions explicitly.
>
> I'll look into doing one of these.
>
> >> I wonder if it would be possible to identify cases when the expression
> >> is executed in a loop, e.g. like this:
> >>
> >>  FOR i IN 1..1000 LOOP
> >>  x := y IN (1, 2, ..., 999);
> >>  END LOOP;
> >>
> >> in which case we only build the ScalarArrayOpExpr once, so maybe we
> >> could keep the hash table for all executions. But maybe that's not
> >> possible or maybe it's pointless for other reasons. It sure looks a bit
> >> like trying to build a query engine from FOR loop.
> >
> >
> > Theoretically it is possible, not now. But I don't think so it is
> necessary. I cannot to remember this pattern in any plpgsql code and I
> never seen any request on this feature.
> >
> > I don't think so this is task for plpgsql engine. Maybe for JIT
> sometimes.
>
> Agreed. I'd thought about this kind of scenario when I brought this
> up, but I think solving it would the responsibility of the pg/pgsql
> compiler rather than the expression evaluation code, because it'd have
> to recognize the situation and setup a shared expression evaluation
> context to be reused each time through the loop.
>

can be nice if new implementation was not slower then older in all
environments and context (including plpgsql expressions)

Regards

Pavel


> James
>


Re: Bogus documentation for bogus geometric operators

2020-04-28 Thread Tom Lane
Emre Hasegeli  writes:
>> Perhaps it's too late in the v13 cycle to actually do anything
>> about this code-wise, but what should I do documentation-wise?
>> I'm certainly not eager to document that these operators behave
>> inconsistently depending on which type you're talking about.

> I don't think we need to worry too much about doing something in the
> v13 cycle.  The geometric operators had and evidently still have so
> many bugs.  Nobody complains about them other than the developers who
> read the code.

Yeah, I ended up just documenting the current state of affairs.

> I am happy to prepare a patch for the next release to fix the current
> operators and add the missing ones.

Sounds great!

regards, tom lane




Re: Bogus documentation for bogus geometric operators

2020-04-28 Thread Emre Hasegeli
> Perhaps it's too late in the v13 cycle to actually do anything
> about this code-wise, but what should I do documentation-wise?
> I'm certainly not eager to document that these operators behave
> inconsistently depending on which type you're talking about.

I don't think we need to worry too much about doing something in the
v13 cycle.  The geometric operators had and evidently still have so
many bugs.  Nobody complains about them other than the developers who
read the code.

I am happy to prepare a patch for the next release to fix the current
operators and add the missing ones.




Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-28 Thread Alvaro Herrera
On 2020-Apr-28, Kyotaro Horiguchi wrote:

> At Mon, 27 Apr 2020 18:33:42 -0400, Alvaro Herrera  
> wrote in 
> > On 2020-Apr-08, Kyotaro Horiguchi wrote:
> > 
> > > At Wed, 08 Apr 2020 09:37:10 +0900 (JST), Kyotaro Horiguchi 
> > >  wrote in 

> > Thanks for the fix!  I propose two changes:
> > 
> > 1. reword the error like this:
> > 
> > ERROR:  replication slot "regression_slot3" cannot be advanced
> > DETAIL:  This slot has never previously reserved WAL, or has been 
> > invalidated
> 
> Agreed to describe what is failed rather than the cause.  However,
> logical replications slots are always "previously reserved" at
> creation.

Bah, of course.  I was thinking in making the equivalent messages all
identical in all callsites, but maybe they should be different when
slots are logical.  I'll go over them again.

> > 2. use the same error in one other place, to wit
> >pg_logical_slot_get_changes() and pg_replication_slot_advance().  I
> >made the DETAIL part the same in all places, but the ERROR line is
> >adjusted to what each callsite is doing.
> >I do think that this change in test_decoding is a bit unpleasant:
> > 
> > -ERROR:  cannot use physical replication slot for logical decoding
> > +ERROR:  cannot get changes from replication slot "repl"
> > 
> >The test is
> >   -- check that we're detecting a streaming rep slot used for logical 
> > decoding
> >   SELECT 'init' FROM pg_create_physical_replication_slot('repl');
> >   SELECT data FROM pg_logical_slot_get_changes('repl', NULL, NULL, 
> > 'include-xids', '0', 'skip-empty-xacts', '1');
> 
> The message may be understood as "No change has been made since
> restart_lsn". Does something like the following work?
> 
> ERROR:  replication slot "repl" is not usable to get changes

That wording seems okay, but my specific point for this error message is
that we were trying to use a physical slot to get logical changes; so
the fact that the slot has been invalidated is secondary and we should
complain about the *type* of slot rather than the restart_lsn.


> By the way there are some other messages that doesn't render the
> symptom but the cause.
> 
> "cannot use physical replication slot for logical decoding"
> "replication slot \"%s\" was not created in this database"
> 
> Don't they need the same amendment?

Maybe, but I don't want to start rewording every single message in uses
of replication slots ... I prefer to only modify the ones related to the
problem at hand.

> > > > On the other hand, physical replication doesn't break by invlidation.
> > > > [...]

> > Anyway I think the current patch can be applied as is -- and if we want
> > physical replication to have some other behavior, we can patch for that
> > afterwards.
> 
> Agreed here. The false-invalidation doesn't lead to any serious
> consequences.

But does it?  What happens, for example, if we have a slot used to get a
pg_basebackup, then time passes before starting to stream from it and is
invalidated?  I think this "works fine" (meaning that once we try to
stream from the slot to replay at the restored base backup, we will
raise an error immediately), but I haven't tried.

The worst situation would be producing a corrupt replica.  I don't think
this is possible.

The ideal behavior I think would be that pg_basebackup aborts
immediately when the slot is invalidated, to avoid wasting more time
producing a doomed backup.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: PostgreSQL CHARACTER VARYING vs CHARACTER VARYING (Length)

2020-04-28 Thread Paul Carlucci
PG the text, character varying, character varying(length), character column
types are all the same thing with each column type inheriting the
properties from the parent type.  With each successive type further
properties are added but they're all basically just "text" with some
additional metadata.  If you're coming from other database engines or just
general programming languages where text and fixed length string fields are
handled differently then the above can seem a bit different form what
you're used to.  Heck, I can think of one engine where if you have a text
column you have to query the table for the blob identifier and then issue a
separate call to retrieve it.  Here in PG it's literally all the same,
handled the same, performs the same.  Use what limiters make sense for your
application.

On Tue, Apr 28, 2020 at 5:22 AM Rajin Raj  wrote:

> Is there any impact of using the character varying without providing the
> length while creating tables?
> I have created two tables and inserted 1M records. But I don't see any
> difference in pg_class. (size, relpage)
>
> create table test_1(name varchar);
> create table test_2(name varchar(50));
>
> insert into test_1 ... 10M records
> insert into test_2 ... 10M records
>
> vacuum (full,analyze) db_size_test_1;
> vacuum (full,analyze) db_size_test_2;
>
> Which option is recommended?
>
> *Regards,*
> *Rajin *
>


Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

2020-04-28 Thread Peter Geoghegan
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
 wrote:
> I agree with both patches.

Thanks for the review.

> For the first fix it seems better to push down the logic to the page
> deletion code as your 0001 patch does so. The following change changes
> the page deletion code so that it emits LOG message indicating the
> index corruption when a deleted page is passed. But we used to ignore
> in this case.

> I agree with this change but since I'm not sure this change directly
> relates to this bug I wonder if it can be a separate patch.

I am not surprised that you are concerned about this. That was the
hardest decision I had to make when writing the patch.

The central idea in the 0001-* patch (as I say at the end of the
commit message) is that we have a strict rule about where we handle
maintaining the oldest btpo.xact for already-deleted pages, and where
we handle it for pages that become deleted. The rules is this:
btvacuumpage() is strictly responsible for the former category (as
well as totally live pages), while _bt_pagedel() is responsible for
the latter category (this includes pages that started out as half-dead
but become deleted).

In order for this to work, _bt_pagedel() must not ever see a deleted
page that it did not delete itself. If I didn't explicitly take a
strict position on this, then I suppose that I'd have to have similar
handling for maintaining the oldest btpo.xact for existing deleted
pages encountered within _bt_pagedel(). But that doesn't make any
sense, even with corruption, so I took a strict position. Even with
corruption, how could we encounter an *existing* deleted page in
_bt_pagedel() but not encounter the same page within some call to
btvacuumpage() made from btvacuumscan()? In other words, how can we
fail to maintain the oldest btpo.xact for deleted pages even if we
assume that the index has a corrupt page with sibling links that point
to a fully deleted page? It seems impossible, even in this extreme,
contrived scenario.

Then there is the separate question of the new log message about
corruption. It's not too hard to see why _bt_pagedel() cannot ever
access an existing deleted page unless there is corruption:

* _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
never pass a deleted page -- it handles those directly.

* _bt_pagedel() can only access additional leaf pages to consider
deleting them by following the right sibling link of the
btvacuumpage() caller's leaf page (or possibly some page even further
to the right if it ends up deleting many leaf pages in one go).
Following right links and finding a deleted page can only happen when
there is a concurrent page deletion by VACUUM, since the sibling links
to the deleted page are changed by the second stage of deletion (i.e.
by the atomic actionat where the page is actually marked deleted,
within _bt_unlink_halfdead_page()).

* There cannot be a concurrent VACUUM, though, since _bt_pagedel()
runs in VACUUM. (Note that we already depend on this for correctness
in _bt_unlink_halfdead_page(), which has a comment that says "a
half-dead page cannot resurrect because there can be only one vacuum
process running at a time".)

The strict position seems justified. It really isn't that strict. I'm
not concerned about the new LOG message concerning corruption being
annoying or wrong

-- 
Peter Geoghegan




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread James Coleman
On Tue, Apr 28, 2020 at 11:18 AM Pavel Stehule  wrote:
>
>
>
> út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra  
> napsal:
>>
>> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
>> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra 
>> >
>> >napsal:
>> >
>> >> ...
>> >>
>> >> >I'm not so concerned about this in any query where we have a real FROM
>> >> >clause because even if we end up with only one row, the relative
>> >> >penalty is low, and the potential gain is very high. But simple
>> >> >expressions in pl/pgsql, for example, are a case where we can know for
>> >> >certain (correct me if I've wrong on this) that we'll only execute the
>> >> >expression once, which means there's probably always a penalty for
>> >> >choosing the implementation with setup costs over the default linear
>> >> >scan through the array.
>> >> >
>> >>
>> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
>> >> see it mostly as a way to glue together SQL queries, but yeah - if we
>> >> know a given ScalarArrayOpExpr will only be executed once, then we can
>> >> disable this optimization for now.
>> >>
>> >
>> >a := a + 1
>> >
>> >is translated to
>> >
>> >SELECT $1 + 1 and save result to var a
>> >
>> >The queries like this "SELECT $1 + 1" are simple expressions. They are
>> >evaluated just on executor level - it skip SPI
>> >
>> >the simple expression has not FROM clause, and have to return just one row.
>> >I am not sure if it is required, it has to return just one column.

Yes, this is what I meant by simple expressions.

>> >I am not sure if executor knows so expression is executed as simply
>> >expressions. But probably it can be deduced from context
>> >
>>
>> Not sure. The executor state is created by exec_eval_simple_expr, which
>> calls ExecInitExprWithParams (and it's the only caller). And that in
>> turn is the only place that leaves (state->parent == NULL). So maybe
>> that's a way to identify simple (standalone) expressions? Otherwise we
>> might add a new EEO_FLAG_* to identify these expressions explicitly.

I'll look into doing one of these.

>> I wonder if it would be possible to identify cases when the expression
>> is executed in a loop, e.g. like this:
>>
>>  FOR i IN 1..1000 LOOP
>>  x := y IN (1, 2, ..., 999);
>>  END LOOP;
>>
>> in which case we only build the ScalarArrayOpExpr once, so maybe we
>> could keep the hash table for all executions. But maybe that's not
>> possible or maybe it's pointless for other reasons. It sure looks a bit
>> like trying to build a query engine from FOR loop.
>
>
> Theoretically it is possible, not now. But I don't think so it is necessary. 
> I cannot to remember this pattern in any plpgsql code and I never seen any 
> request on this feature.
>
> I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.

Agreed. I'd thought about this kind of scenario when I brought this
up, but I think solving it would the responsibility of the pg/pgsql
compiler rather than the expression evaluation code, because it'd have
to recognize the situation and setup a shared expression evaluation
context to be reused each time through the loop.

James




Re: pg_rewind docs correction

2020-04-28 Thread James Coleman
On Tue, Apr 28, 2020 at 12:31 AM Michael Paquier  wrote:
>
> On Mon, Mar 09, 2020 at 09:26:17AM -0400, James Coleman wrote:
> >> -  pg_stat_tmp/, and
> >> -  pg_subtrans/ are omitted from the data copied
> >> -  from the source cluster. Any file or directory beginning with
> >> -  pgsql_tmp is omitted, as well as are
> >> +  pg_stat_tmp/, and
> >> pg_subtrans/
> >> +  are omitted from the data copied from the source cluster. The files
> >>
> >> This is just reorganizing an existing list, why?
> >>
> >
> > The grammar seemed a bit awkward to me, so while I was already reworking
> > this paragraph I tried to clean that up a bit.
>
> Thanks for the new patch, and sorry for the delay.
>
> Okay, I saw what you were coming at here, with one sentence for
> directories, and one for files.
>
> > Still ongoing, correct? I guess I mentally think of them as being only one
> > month, but I guess that's not actually true. Regardless I'm not sure what
> > policy is for patches that have been in flight in hackers for a while but
> > just missed being added to the CF app.
>
> This is a documentation patch, so improving this part of the docs now
> is fine by me, particularly as this is an improvement.  Here are more
> notes from me:
> - I have removed the "As with a base backup" at the beginning of the
> second paragraph you modified.  The first paragraph modified already
> references a base backup, so one reference is enough IMO.
> - WAL replay does not happen from the WAL position where WAL diverged,
> but from the last checkpoint before WAL diverged.
> - Did some tweaks about the new part for configuration files, as it
> may actually not be necessary to update the configuration for recovery
> to complete (depending on the settings of the source, the target may
> just require the creation of a standby.signal file in its data
> directory particularly with a common archive location for multiple
> clusters).
> - Some word-smithing in the step-by-step description.
>
> Is the updated version fine for you?

In your revised patched the follow paragraph:

+   
+As pg_rewind copies configuration files
+entirely from the source, it may be required to correct the configuration
+used for recovery before restarting the target server, especially the
+the target is reintroduced as a standby of the source. If you restart
+the server after the rewind operation has finished but without configuring
+recovery, the target may again diverge from the primary.
+   

I think is missing a word. Instead of "especially the the target"
should be "especially if the target".

In this block:

+  Create a backup_label file to begin WAL replay at
+  the checkpoint created at failover and configure the
+  pg_control file with a minimum consistency LSN
+  defined as the result of pg_current_wal_insert_lsn()
+  when rewinding from a live source and using the last checkpoint LSN
+  when rewinding from a stopped source.
+ 

Perhaps change "and using the last checkpoint LSN" to "or the last
checkpoint LSN". Alternatively you could make the grammar parallel by
changing to "and defined as the last checkpoint LSN", but that seems
wordy, and the "defined as [item or item]" is already a good grammar
construction.

Other than those two small things, your proposed revision looks good to me.

Thanks,
James




Re: Proposing WITH ITERATIVE

2020-04-28 Thread Jonah H. Harris
On Tue, Apr 28, 2020 at 11:51 AM Stephen Frost  wrote:

> Greetings Jonah!
>

Hey, Stephen. Hope you're doing well, man!


> One of the first question that we need to ask though, imv anyway, is- do
> the other databases use the WITH ITERATIVE syntax?  How many of them?
> Are there other approaches?  Has this been brought up to the SQL
> committee?
>

There are four that I'm aware of, but I'll put together a full list. I
don't believe it's been proposed to the standards committee, but I'll see
if I can find transcripts/proposals. I imagine those are all still public.

In general, we really prefer to avoid extending SQL beyond the standard,
> especially in ways that the standard is likely to be expanded.  In
> other words, we'd really prefer to avoid the risk that the SQL committee
> declares WITH ITERATIVE to mean something else in the future, causing us
> to have to have a breaking change to become compliant.


Agreed. That's my main concern as well.


> Now, if all the other major DB vendors have WITH ITERATIVE and they all
> work the same
> way, then we can have at least some confidence that the SQL committee
> will end up defining it in that way and there won't be any issue.
>

This is where it sucks, as each vendor has done it differently. One uses
WITH ITERATE, one uses WITH ITERATIVE, others use, what I consider to be, a
nasty operator-esque style... I think ITERATIVE makes the most sense, but
it does create a future contention as that definitely seems like the syntax
the committee would use as well.

Oracle ran into this issue as well, which is why they added an additional
clause to WITH that permits selection of depth/breadth-first search et al.
While it's kinda nasty, we could always similarly amend WITH RECURSIVE to
add an additional ITERATE or something to the tail of the with_clause rule.
But, that's why I'm looking for feedback :)

We do have someone on the committee who watches these lists, hopefully
> they'll have a chance to comment on this.  Perhaps it's already in
> discussion in the committee.
>

That would be awesome.

-- 
Jonah H. Harris


Re: Proposing WITH ITERATIVE

2020-04-28 Thread Jonah H. Harris
On Mon, Apr 27, 2020 at 8:50 PM Jeff Davis  wrote:

> Hi,
>

Hey, Jeff. Long time no talk. Good to see you're still on here.

You might get better feedback in a month or so; right now we just got
> into feature freeze.
>

Yep. No hurry. I've just been playing with this and wanted to start getting
feedback. It's a side-project for me anyway, so time is limited.


> Can you illustrate with some examples? I get that one is appending and
> the other is modifying in-place, but how does this end up looking in
> the query language?
>

I'm putting together a few concrete real-world examples.

> Rather than stopping when no new tuples are generated, WITH ITERATIVE
> > stops when a user-defined predicate evaluates to true.
>
> Why stop when it evaluates to true, and not false?
>

It's how they implemented it. A few other databases have implemented
similar functionality but, as it's not standard, it's kinda just up to each
implementor. I'm not married to that idea, but it has worked well for me so
far.

It seems like the benefit comes from carrying information along within
> tuples (by adding to scores or counters) rather than appending tuples.
> Is it possible to achieve this in other ways? The recursive CTE
> implementation is a very direct implementation of the standard, perhaps
> there are smarter approaches?
>

Yeah, in that specific case, one of the other implementations seems to
carry the counters along in the executor itself. But, as not all uses of
this functionality are iteration-count-based, I think that's a little
limiting. Using a terminator expression (of some kind) seems most
adaptable, I think. I'll give some examples of both types of cases.

-- 
Jonah H. Harris


Re: Proposing WITH ITERATIVE

2020-04-28 Thread Stephen Frost
Greetings Jonah!

* Jonah H. Harris (jonah.har...@gmail.com) wrote:
> On Tue, Apr 28, 2020 at 6:19 AM Andreas Karlsson  wrote:
> 
> > Do you have any examples of queries where it would help? It is pretty
> > hard to say how much value some new syntax adds without seeing how it
> > improves an intended use case.
> 
> Thanks for the response. I'm currently working on a few examples per Jeff's
> response along with some benchmark information including improvements in
> response time and memory usage of the current implementation. In the
> meantime, as this functionality has been added to a couple of other
> databases and there's academic research on it, if you're interested, here's
> a few papers with examples:
> 
> http://faculty.neu.edu.cn/cc/zhangyf/papers/2018-ICDCS2018-sqloop.pdf
> http://db.in.tum.de/~passing/publications/dm_in_hyper.pdf

Nice!

One of the first question that we need to ask though, imv anyway, is- do
the other databases use the WITH ITERATIVE syntax?  How many of them?
Are there other approaches?  Has this been brought up to the SQL
committee?

In general, we really prefer to avoid extending SQL beyond the standard,
especially in ways that the standard is likely to be expanded.  In
other words, we'd really prefer to avoid the risk that the SQL committee
declares WITH ITERATIVE to mean something else in the future, causing us
to have to have a breaking change to become compliant.  Now, if all the
other major DB vendors have WITH ITERATIVE and they all work the same
way, then we can have at least some confidence that the SQL committee
will end up defining it in that way and there won't be any issue.

We do have someone on the committee who watches these lists, hopefully
they'll have a chance to comment on this.  Perhaps it's already in
discussion in the committee.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Poll: are people okay with function/operator table redesign?

2020-04-28 Thread Tom Lane
"Jonathan S. Katz"  writes:
> On 4/27/20 8:49 AM, Tom Lane wrote:
>> Other than that point, the main.css patch as I presented it just adds
>> some rules that aren't used yet, so it could be pushed as soon as you're
>> satisfied about the !important change.  It'd probably make sense to
>> push it in advance of making the markup changes, so we don't have an
>> interval of near-unreadable devel docs.

> *nods* I'll ensure to test again and hopefully commit later today.

After looking at the JSON function tables, I've concluded that the
ability to have more than one function signature per table cell is
really rather essential not optional.  So I'm going to go ahead and
convert all the existing markup to the -based style I proposed
on Sunday.  Please push the main.css change when you can.

regards, tom lane




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Pavel Stehule
út 28. 4. 2020 v 16:48 odesílatel Tomas Vondra 
napsal:

> On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:
> >út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra <
> tomas.von...@2ndquadrant.com>
> >napsal:
> >
> >> ...
> >>
> >> >I'm not so concerned about this in any query where we have a real FROM
> >> >clause because even if we end up with only one row, the relative
> >> >penalty is low, and the potential gain is very high. But simple
> >> >expressions in pl/pgsql, for example, are a case where we can know for
> >> >certain (correct me if I've wrong on this) that we'll only execute the
> >> >expression once, which means there's probably always a penalty for
> >> >choosing the implementation with setup costs over the default linear
> >> >scan through the array.
> >> >
> >>
> >> What do you mean by "simple expressions"? I'm not plpgsql expert and I
> >> see it mostly as a way to glue together SQL queries, but yeah - if we
> >> know a given ScalarArrayOpExpr will only be executed once, then we can
> >> disable this optimization for now.
> >>
> >
> >a := a + 1
> >
> >is translated to
> >
> >SELECT $1 + 1 and save result to var a
> >
> >The queries like this "SELECT $1 + 1" are simple expressions. They are
> >evaluated just on executor level - it skip SPI
> >
> >the simple expression has not FROM clause, and have to return just one
> row.
> >I am not sure if it is required, it has to return just one column.
> >
> >I am not sure if executor knows so expression is executed as simply
> >expressions. But probably it can be deduced from context
> >
>
> Not sure. The executor state is created by exec_eval_simple_expr, which
> calls ExecInitExprWithParams (and it's the only caller). And that in
> turn is the only place that leaves (state->parent == NULL). So maybe
> that's a way to identify simple (standalone) expressions? Otherwise we
> might add a new EEO_FLAG_* to identify these expressions explicitly.
>
> I wonder if it would be possible to identify cases when the expression
> is executed in a loop, e.g. like this:
>
>  FOR i IN 1..1000 LOOP
>  x := y IN (1, 2, ..., 999);
>  END LOOP;
>
> in which case we only build the ScalarArrayOpExpr once, so maybe we
> could keep the hash table for all executions. But maybe that's not
> possible or maybe it's pointless for other reasons. It sure looks a bit
> like trying to build a query engine from FOR loop.
>

Theoretically it is possible, not now. But I don't think so it is
necessary. I cannot to remember this pattern in any plpgsql code and I
never seen any request on this feature.

I don't think so this is task for plpgsql engine. Maybe for JIT sometimes.


> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


Re: PG compilation error with Visual Studio 2015/2017/2019

2020-04-28 Thread davinder singh
On Mon, Apr 27, 2020 at 4:50 PM Amit Kapila  wrote:

> Bemba_Zambia
> Bena_Tanzania
> Bulgarian_Bulgaria
> Swedish_Sweden.1252
> Swedish_Sweden
>

I have tested with different locales with codepages including above. There
are few which return different locale code but the error messages in both
the cases are the same. I have attached the test and log files.
But there was one case, where locale code and error messages both are
different.
Portuguese_Brazil.1252

log from [1]
2020-04-28 14:27:39.785 GMT [2284] DEBUG:  IsoLocaleName() executed;
locale: "pt"
2020-04-28 14:27:39.787 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:39.787 GMT [2284] STATEMENT:  Select 1/0;

log from [2]
2020-04-28 14:36:20.666 GMT [14608] DEBUG:  IsoLocaleName() executed;
locale: "pt_BR"
2020-04-28 14:36:20.673 GMT [14608] ERRO:  divisão por zero
2020-04-28 14:36:20.673 GMT [14608] COMANDO:  Select 1/0;

[1] full_locale_lc_message_test_create_locale_1.txt: log generated by using
the old patch (it uses _create_locale API to get locale info)
[2] full_locale_lc_message_test_getlocale_1.txt: log generated using the
patch v13

-- 
Regards,
Davinder
EnterpriseDB: http://www.enterprisedb.com
2020-04-28 14:27:36.353 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"af"
2020-04-28 14:27:36.362 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.362 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.411 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"sq"
2020-04-28 14:27:36.414 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.414 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.432 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"am"
2020-04-28 14:27:36.435 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.435 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.456 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_DZ"
2020-04-28 14:27:36.459 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.459 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.480 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_BH"
2020-04-28 14:27:36.483 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.483 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.505 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_EG"
2020-04-28 14:27:36.507 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.507 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.529 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_IQ"
2020-04-28 14:27:36.531 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.531 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.553 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_JO"
2020-04-28 14:27:36.555 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.555 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.581 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_KW"
2020-04-28 14:27:36.584 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.584 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.604 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_LB"
2020-04-28 14:27:36.606 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.606 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.626 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_LY"
2020-04-28 14:27:36.634 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.634 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.655 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_MA"
2020-04-28 14:27:36.665 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.665 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.687 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_OM"
2020-04-28 14:27:36.697 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.697 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.719 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_QA"
2020-04-28 14:27:36.722 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.722 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.741 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar"
2020-04-28 14:27:36.749 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.749 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.774 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_SY"
2020-04-28 14:27:36.776 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.776 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.795 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_TN"
2020-04-28 14:27:36.803 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.803 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.824 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_AE"
2020-04-28 14:27:36.832 GMT [2284] ERROR:  division by zero
2020-04-28 14:27:36.832 GMT [2284] STATEMENT:  Select 1/0;
2020-04-28 14:27:36.853 GMT [2284] DEBUG:  IsoLocaleName() executed; locale: 
"ar_YE"
2020-04-28 14:27:36.862 GMT [2284] ERROR:  

Re: Proposing WITH ITERATIVE

2020-04-28 Thread Jonah H. Harris
On Tue, Apr 28, 2020 at 6:19 AM Andreas Karlsson  wrote:

> Do you have any examples of queries where it would help? It is pretty
> hard to say how much value some new syntax adds without seeing how it
> improves an intended use case.
>

Hey, Andreas.

Thanks for the response. I'm currently working on a few examples per Jeff's
response along with some benchmark information including improvements in
response time and memory usage of the current implementation. In the
meantime, as this functionality has been added to a couple of other
databases and there's academic research on it, if you're interested, here's
a few papers with examples:

http://faculty.neu.edu.cn/cc/zhangyf/papers/2018-ICDCS2018-sqloop.pdf
http://db.in.tum.de/~passing/publications/dm_in_hyper.pdf

-- 
Jonah H. Harris


Re: Proposing WITH ITERATIVE

2020-04-28 Thread Jonah H. Harris
On Tue, Apr 28, 2020 at 6:16 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> will help the community to focus on the actual details of your proposal.
>

I'd like it if the community would focus on the details of the proposal as
well :)

-- 
Jonah H. Harris


Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Tomas Vondra

On Tue, Apr 28, 2020 at 03:43:43PM +0200, Pavel Stehule wrote:

út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra 
napsal:


...

>I'm not so concerned about this in any query where we have a real FROM
>clause because even if we end up with only one row, the relative
>penalty is low, and the potential gain is very high. But simple
>expressions in pl/pgsql, for example, are a case where we can know for
>certain (correct me if I've wrong on this) that we'll only execute the
>expression once, which means there's probably always a penalty for
>choosing the implementation with setup costs over the default linear
>scan through the array.
>

What do you mean by "simple expressions"? I'm not plpgsql expert and I
see it mostly as a way to glue together SQL queries, but yeah - if we
know a given ScalarArrayOpExpr will only be executed once, then we can
disable this optimization for now.



a := a + 1

is translated to

SELECT $1 + 1 and save result to var a

The queries like this "SELECT $1 + 1" are simple expressions. They are
evaluated just on executor level - it skip SPI

the simple expression has not FROM clause, and have to return just one row.
I am not sure if it is required, it has to return just one column.

I am not sure if executor knows so expression is executed as simply
expressions. But probably it can be deduced from context



Not sure. The executor state is created by exec_eval_simple_expr, which
calls ExecInitExprWithParams (and it's the only caller). And that in
turn is the only place that leaves (state->parent == NULL). So maybe
that's a way to identify simple (standalone) expressions? Otherwise we
might add a new EEO_FLAG_* to identify these expressions explicitly.

I wonder if it would be possible to identify cases when the expression
is executed in a loop, e.g. like this:

FOR i IN 1..1000 LOOP
x := y IN (1, 2, ..., 999);
END LOOP;

in which case we only build the ScalarArrayOpExpr once, so maybe we
could keep the hash table for all executions. But maybe that's not
possible or maybe it's pointless for other reasons. It sure looks a bit
like trying to build a query engine from FOR loop.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: More efficient RI checks - take 2

2020-04-28 Thread Tom Lane
Stephen Frost  writes:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> As you say, perhaps there's room for both things, but also as you say,
>> it's not obvious how to decide intelligently between them.

> The single-row case seems pretty clear and also seems common enough that
> it'd be worth paying the cost to figure out if it's a single-row
> statement or not.

That seems hard to do in advance ... but it would be easy to code
a statement-level AFTER trigger along the lines of

if (transition table contains one row)
// fast special case here
else
// slow general case here.

I think the question really comes down to this: is the per-row overhead of
the transition-table mechanism comparable to that of the AFTER trigger
queue?  Or if not, can we make it so?

regards, tom lane




Re: Binary COPY IN size reduction

2020-04-28 Thread Stephen Frost
Greetings,

* Lőrinc Pap (lor...@gradle.com) wrote:
> Thanks for the quick response, Tom!

We prefer to not top-post on these lists, just fyi.

> What about implementing only the first part of my proposal, i.e. BINARY
> COPY without the redundant column count & size info?

For my part, at least, I like the idea- but I'd encourage thinking about
what we might do in a mixed-response situation too, as that's something
that's been discussed as at least desirable.  As long as we aren't
ending up painting ourselves into a corner somehow (which it doesn't
seem like we are, but I've not looked deeply at it) and we don't break
any existing clients, I'd generally be supportive of such an
improvement.  Constantly sending "this 4-byte int is 4 bytes long"
certainly does seem like a waste of bandwidth.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [pg_dump] 'create index' statement is failing due to search_path is empty

2020-04-28 Thread Tom Lane
tushar  writes:
> While testing something else ,i found 1 scenario  where pg_dump  is failing

> CREATE FUNCTION do_analyze() RETURNS VOID VOLATILE LANGUAGE SQL
>      AS 'ANALYZE pg_am';
> CREATE FUNCTION wrap_do_analyze(c INT) RETURNS INT IMMUTABLE LANGUAGE SQL
>      AS 'SELECT $1 FROM do_analyze()';
> CREATE INDEX ON vaccluster(wrap_do_analyze(i));
> INSERT INTO vaccluster VALUES (1), (2);

You failed to schema-qualify the function reference.  That's not
a pg_dump bug.

While we're on the subject: this is an intentionally unsafe index.
The system doesn't try very hard to prevent you from lying about the
volatility status of a function ... but when, not if, it breaks
we're not going to regard the consequences as a Postgres bug.
Basically, there isn't anything about this example that I'm not
going to disclaim as "that's not supported".

regards, tom lane




Re: More efficient RI checks - take 2

2020-04-28 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Apr 23, 2020 at 10:35 AM Tom Lane  wrote:
> > I think we're failing to communicate here.  I agree that if the goal
> > is simply to re-implement what the RI triggers currently do --- that
> > is, retail one-row-at-a-time checks --- then we could probably dispense
> > with all the parser/planner/executor overhead and directly implement
> > an indexscan using an API at about the level genam.c provides.
> > (The issue of whether it's okay to require an index to be available is
> > annoying, but we could always fall back to the old ways if one is not.)
> >
> > However, what I thought this thread was about was switching to
> > statement-level RI checking.  At that point, what we're talking
> > about is performing a join involving a not-known-in-advance number
> > of tuples on each side.  If you think you can hard-wire the choice
> > of join technology and have it work well all the time, I'm going to
> > say with complete confidence that you are wrong.  The planner spends
> > huge amounts of effort on that and still doesn't always get it right
> > ... but it does better than a hard-wired choice would do.
> 
> Oh, yeah. If we're talking about that, then getting by without using
> the planner doesn't seem feasible. Sorry, I guess I didn't read the
> thread carefully enough.

Yeah, I had been thinking about what we might do with the existing
row-level RI checks too.  If we're able to get statement-level without
much impact on the single-row-statement case then that's certainly
interesting, although it sure feels like we're ending up with a lot left
on the table.

> As you say, perhaps there's room for both things, but also as you say,
> it's not obvious how to decide intelligently between them.

The single-row case seems pretty clear and also seems common enough that
it'd be worth paying the cost to figure out if it's a single-row
statement or not.

Perhaps we start with row-level for the first row, implemented directly
using an index lookup, and when we hit some threshold (maybe even just
"more than one") switch to using the transient table and queue'ing
the rest to check at the end.

What bothers me the most about this approach (though, to be clear, I
think we should still pursue it) is the risk that we might end up
picking a spectacularly bad plan that ends up taking a great deal more
time than the index-probe based approach we almost always have today.
If we limit that impact to only cases where >1 row is involved, then
that's certainly better (though maybe we'll need a GUC for this
anyway..?  If we had the single-row approach + the statement-level one,
presumably the GUC would just make us always take the single-row method,
so it hopefully wouldn't be too grotty to have).

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Fix compilation failure against LLVM 11

2020-04-28 Thread Jesse Zhang
Hi Michael,

On Mon, Apr 27, 2020 at 9:56 PM Michael Paquier wrote:
> rather soon, it looks like LLVM releases a new major version every 3
> months or so.

FYI LLVM has a six-month release cadence [1], the next release is
expected coming September (I can't tell whether you were joking).

Cheers,
Jesse

[1] https://llvm.org/docs/HowToReleaseLLVM.html#release-timeline




Re: Binary COPY IN size reduction

2020-04-28 Thread Lőrinc Pap
Thanks for the quick response, Tom!
What about implementing only the first part of my proposal, i.e. BINARY
COPY without the redundant column count & size info?
That would already be a big win - I agree the rest of the proposed changes
would only complicate the usage, but I'd argue that leaving out duplicated
info would even simplify it!

I'll give a better example this time - writing *1.8* million rows with
column types bigint, integer, smallint results in the following COPY IN
payloads:

*20.8MB* - Text protocol
*51.3MB* - Binary protocol
*25.6MB* - Binary, without column size info (proposal)


I.e. this would make the binary protocol almost as small as the text one
(which isn't an unreasonable expectation, I think), while making it easier
to use at the same time.

Thanks for your time,
Lőrinc

On Fri, Apr 24, 2020 at 4:19 PM Tom Lane  wrote:

> =?UTF-8?Q?L=C5=91rinc_Pap?=  writes:
> > We've switched recently from TEXT based COPY to the BINARY one.
> > We've noticed a slight performance increase, mostly because we don't need
> > to escape the content anymore.
> > Unfortunately the binary protocol's output ended up being slightly bigger
> > than the text one (e.g. for one payload it's *373MB* now, was *356MB*
> before)
> > ...
> > By skipping the column count and sizes for every row, in our example this
> > change would reduce the payload to *332MB* (most of our payload is
> binary,
> > lightweight structures consisting of numbers only could see a >*2x*
> > decrease in size).
>
> TBH, that amount of gain does not seem to be worth the enormous
> compatibility costs of introducing a new COPY data format.  What you
> propose also makes the format a great deal less robust (readers are
> less able to detect errors), which has other costs.  I'd vote no.
>
> regards, tom lane
>


-- 
Lőrinc Pap
Senior Software Engineer



Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Pavel Stehule
út 28. 4. 2020 v 15:26 odesílatel Tomas Vondra 
napsal:

> On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote:
> >On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
> > wrote:
> >>
> >> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
> >> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman 
> wrote:
> >> >>
> >> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> >> >>  wrote:
> >> >> >
> >> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> >> >> > > wrote:
> >> >> > >>
> >> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley <
> dgrowle...@gmail.com> wrote:
> >> >> > >> >>
> >> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
> >> >> > >> >> > This reminds me our attempts to add bloom filters to hash
> joins, which I
> >> >> > >> >> > think ran into mostly the same challenge of deciding when
> the bloom
> >> >> > >> >> > filter can be useful and is worth the extra work.
> >> >> > >> >>
> >> >> > >> >> Speaking of that, it would be interesting to see how a test
> where you
> >> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares.
> It would
> >> >> > >> >> be interesting to know if the planner is able to make a more
> suitable
> >> >> > >> >> choice and also to see how all the work over the years to
> improve Hash
> >> >> > >> >> Joins compares to the bsearch with and without the bloom
> filter.
> >> >> > >> >
> >> >> > >> >It would be interesting.
> >> >> > >> >
> >> >> > >> >It also makes one wonder about optimizing these into to hash
> >> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a
> very
> >> >> > >> >significant effort though.
> >> >> > >> >
> >> >> > >>
> >> >> > >> I modified the script to also do the join version of the query.
> I can
> >> >> > >> only run it on my laptop at the moment, so the results may be a
> bit
> >> >> > >> different from those I shared before, but it's interesting I
> think.
> >> >> > >>
> >> >> > >> In most cases it's comparable to the binsearch/bloom approach,
> and in
> >> >> > >> some cases it actually beats them quite significantly. It seems
> to
> >> >> > >> depend on how expensive the comparison is - for "int" the
> comparison is
> >> >> > >> very cheap and there's almost no difference. For "text" the
> comparison
> >> >> > >> is much more expensive, and there are significant speedups.
> >> >> > >>
> >> >> > >> For example the test with 100k lookups in array of 10k elements
> and 10%
> >> >> > >> match probability, the timings are these
> >> >> > >>
> >> >> > >>master:  62362 ms
> >> >> > >>binsearch: 201 ms
> >> >> > >>bloom:  65 ms
> >> >> > >>hashjoin:   36 ms
> >> >> > >>
> >> >> > >> I do think the explanation is fairly simple - the bloom filter
> >> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms
> plus
> >> >> > >> some overhead to build and check the bits. The hash join
> probably
> >> >> > >> eliminates a lot of the remaining comparisons, because the hash
> table
> >> >> > >> is sized to have one tuple per bucket.
> >> >> > >>
> >> >> > >> Note: I also don't claim the PoC has the most efficient bloom
> filter
> >> >> > >> implementation possible. I'm sure it could be made faster.
> >> >> > >>
> >> >> > >> Anyway, I'm not sure transforming this to a hash join is worth
> the
> >> >> > >> effort - I agree that seems quite complex. But perhaps this
> suggest we
> >> >> > >> should not be doing binary search and instead just build a
> simple hash
> >> >> > >> table - that seems much simpler, and it'll probably give us
> about the
> >> >> > >> same benefits.
> >> >> > >
> >> >> > >That's actually what I originally thought about doing, but I chose
> >> >> > >binary search since it seemed a lot easier to get off the ground.
> >> >> > >
> >> >> >
> >> >> > OK, that makes perfect sense.
> >> >> >
> >> >> > >If we instead build a hash is there anything else we need to be
> >> >> > >concerned about? For example, work mem? I suppose for the binary
> >> >> > >search we already have to expand the array, so perhaps it's not
> all
> >> >> > >that meaningful relative to that...
> >> >> > >
> >> >> >
> >> >> > I don't think we need to be particularly concerned about work_mem.
> We
> >> >> > don't care about it now, and it's not clear to me what we could do
> about
> >> >> > it - we already have the array in memory anyway, so it's a bit
> futile.
> >> >> > Furthermore, if we need to care about it, it probably applies to
> the
> >> >> > binary search too.
> >> >> >
> >> >> > >I was looking earlier at what our standard hash implementation
> was,
> >> >> > >and it seemed less obvious what was needed to set that up (so
> binary
> >> >> > >search seemed a faster proof of concept). If you happen to have
> any
> >> >> > >pointers to similar usages I should look at, please let me know.
> 

Re: Parallel Append can break run-time partition pruning

2020-04-28 Thread Robert Haas
On Sat, Apr 25, 2020 at 7:25 AM David Rowley  wrote:
> I looked at this again and I don't think what I've got is right.

Apart from the considerations which you raised, I think there's also a
costing issue here. For instance, suppose we have an Append with three
children. Two of them have partial paths which will complete after
consuming 1 second of CPU time, and using a partial path is
unquestionably the best strategy for those children. The third one has
a partial path which will complete after using 10 seconds of CPU time
and a non-partial path which will complete after using 8 seconds of
CPU time. What strategy is best? If we pick the non-partial path, the
time that the whole thing takes to complete will be limited by the
non-partial path, which, since it cannot use parallelism, will take 8
seconds of wall clock time. If we pick the partial path, we have a
total of 12 seconds of CPU time which we can finish in 6 wall clock
seconds with 2 participants, 4 seconds with 3 participants, or 3
seconds with 4 participants. This is clearly a lot better.

Incidentally, the planner knows about the fact that you can't finish
an Append faster than you can finish its non-partial participants. See
append_nonpartial_cost().

Now, I don't think anything necessarily gets broken here by your patch
as written. Planner decisions are made on estimated cost, which is a
proxy for wall clock time, not CPU time. Right now, we only prefer the
non-partial path if we think it can be executed in less time by ONE
worker than the partial path could be executed by ALL workers:

/*
 * Either we've got only a non-partial
path, or we think that
 * a single backend can execute the
best non-partial path
 * faster than all the parallel
backends working together can
 * execute the best partial path.

So, in the above example, we wouldn't even consider the non-partial
path. Even say we just have 2 workers. The partial path will have a
cost of 6, which is less than 8, so it gets rejected. But as the
comment goes on to note:

 * It might make sense to be more
aggressive here.  Even if
 * the best non-partial path is more
expensive than the best
 * partial path, it could still be
better to choose the
 * non-partial path if there are
several such paths that can
 * be given to different workers.  For
now, we don't try to
 * figure that out.

This kind of strategy is particularly appealing when the number of
Append children is large compared to the number of workers. For
instance, if you have an Append with 100 children and you are planning
with 4 workers, it's probably a pretty good idea to be very aggressive
about picking the path that uses the least resources, which the
current algorithm would not do. You're unlikely to end up with idle
workers, because you have so many plans to which they can be
allocated. However, it's possible: it could be that there's one child
which is way bigger than all the others and the really important thing
is to get a partial path for that child, so that it can be attacked in
parallel, even if that means that overall resource consumption is
higher. As the number of children decreases relative to the number of
workers, having partial paths becomes increasingly appealing. To take
a degenerate case, suppose you have 4 workers but only 2 children.
Partial paths should look really appealing, because the alternative is
leaving workers unused.

I *think* your patch is based around the idea of merging the
turn-the-append-of-partial-paths-into-a-parallel-append case with the
consider-a-mix-of-partial-and-nonpartial-paths-for-parallel-append
case. That seems possible to do given that the current heuristic is to
compare the raw path costs, but I think that it wouldn't work if we
wanted to be more aggressive about considering the mixed strategy and
letting the costing machinery sort out which way is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Tomas Vondra

On Tue, Apr 28, 2020 at 08:39:18AM -0400, James Coleman wrote:

On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
 wrote:


On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
>On Sun, Apr 26, 2020 at 7:41 PM James Coleman  wrote:
>>
>> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
>>  wrote:
>> >
>> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
>> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
>> > > wrote:
>> > >>
>> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley  
wrote:
>> > >> >>
>> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra 
 wrote:
>> > >> >> > This reminds me our attempts to add bloom filters to hash joins, 
which I
>> > >> >> > think ran into mostly the same challenge of deciding when the bloom
>> > >> >> > filter can be useful and is worth the extra work.
>> > >> >>
>> > >> >> Speaking of that, it would be interesting to see how a test where you
>> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> > >> >> be interesting to know if the planner is able to make a more suitable
>> > >> >> choice and also to see how all the work over the years to improve 
Hash
>> > >> >> Joins compares to the bsearch with and without the bloom filter.
>> > >> >
>> > >> >It would be interesting.
>> > >> >
>> > >> >It also makes one wonder about optimizing these into to hash
>> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
>> > >> >significant effort though.
>> > >> >
>> > >>
>> > >> I modified the script to also do the join version of the query. I can
>> > >> only run it on my laptop at the moment, so the results may be a bit
>> > >> different from those I shared before, but it's interesting I think.
>> > >>
>> > >> In most cases it's comparable to the binsearch/bloom approach, and in
>> > >> some cases it actually beats them quite significantly. It seems to
>> > >> depend on how expensive the comparison is - for "int" the comparison is
>> > >> very cheap and there's almost no difference. For "text" the comparison
>> > >> is much more expensive, and there are significant speedups.
>> > >>
>> > >> For example the test with 100k lookups in array of 10k elements and 10%
>> > >> match probability, the timings are these
>> > >>
>> > >>master:  62362 ms
>> > >>binsearch: 201 ms
>> > >>bloom:  65 ms
>> > >>hashjoin:   36 ms
>> > >>
>> > >> I do think the explanation is fairly simple - the bloom filter
>> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
>> > >> some overhead to build and check the bits. The hash join probably
>> > >> eliminates a lot of the remaining comparisons, because the hash table
>> > >> is sized to have one tuple per bucket.
>> > >>
>> > >> Note: I also don't claim the PoC has the most efficient bloom filter
>> > >> implementation possible. I'm sure it could be made faster.
>> > >>
>> > >> Anyway, I'm not sure transforming this to a hash join is worth the
>> > >> effort - I agree that seems quite complex. But perhaps this suggest we
>> > >> should not be doing binary search and instead just build a simple hash
>> > >> table - that seems much simpler, and it'll probably give us about the
>> > >> same benefits.
>> > >
>> > >That's actually what I originally thought about doing, but I chose
>> > >binary search since it seemed a lot easier to get off the ground.
>> > >
>> >
>> > OK, that makes perfect sense.
>> >
>> > >If we instead build a hash is there anything else we need to be
>> > >concerned about? For example, work mem? I suppose for the binary
>> > >search we already have to expand the array, so perhaps it's not all
>> > >that meaningful relative to that...
>> > >
>> >
>> > I don't think we need to be particularly concerned about work_mem. We
>> > don't care about it now, and it's not clear to me what we could do about
>> > it - we already have the array in memory anyway, so it's a bit futile.
>> > Furthermore, if we need to care about it, it probably applies to the
>> > binary search too.
>> >
>> > >I was looking earlier at what our standard hash implementation was,
>> > >and it seemed less obvious what was needed to set that up (so binary
>> > >search seemed a faster proof of concept). If you happen to have any
>> > >pointers to similar usages I should look at, please let me know.
>> > >
>> >
>> > I think the hash join implementation is far too complicated. It has to
>> > care about work_mem, so it implements batching, etc. That's a lot of
>> > complexity we don't need here. IMO we could use either the usual
>> > dynahash, or maybe even the simpler simplehash.
>> >
>> > FWIW it'd be good to verify the numbers I shared, i.e. checking that the
>> > benchmarks makes sense and running it independently. I'm not aware of
>> > any issues but it was done late at night and only ran on my laptop.
>>
>> Some quick calculations (don't have the scripting in a form I can
>> attach yet; using this as 

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread James Coleman
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
 wrote:
>
> On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:
> >On Sun, Apr 26, 2020 at 7:41 PM James Coleman  wrote:
> >>
> >> On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
> >>  wrote:
> >> >
> >> > On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >> > >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> >> > > wrote:
> >> > >>
> >> > >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> > >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley  
> >> > >> >wrote:
> >> > >> >>
> >> > >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra 
> >> > >> >>  wrote:
> >> > >> >> > This reminds me our attempts to add bloom filters to hash joins, 
> >> > >> >> > which I
> >> > >> >> > think ran into mostly the same challenge of deciding when the 
> >> > >> >> > bloom
> >> > >> >> > filter can be useful and is worth the extra work.
> >> > >> >>
> >> > >> >> Speaking of that, it would be interesting to see how a test where 
> >> > >> >> you
> >> > >> >> write the query as IN(VALUES(...)) instead of IN() compares. It 
> >> > >> >> would
> >> > >> >> be interesting to know if the planner is able to make a more 
> >> > >> >> suitable
> >> > >> >> choice and also to see how all the work over the years to improve 
> >> > >> >> Hash
> >> > >> >> Joins compares to the bsearch with and without the bloom filter.
> >> > >> >
> >> > >> >It would be interesting.
> >> > >> >
> >> > >> >It also makes one wonder about optimizing these into to hash
> >> > >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> > >> >significant effort though.
> >> > >> >
> >> > >>
> >> > >> I modified the script to also do the join version of the query. I can
> >> > >> only run it on my laptop at the moment, so the results may be a bit
> >> > >> different from those I shared before, but it's interesting I think.
> >> > >>
> >> > >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> > >> some cases it actually beats them quite significantly. It seems to
> >> > >> depend on how expensive the comparison is - for "int" the comparison 
> >> > >> is
> >> > >> very cheap and there's almost no difference. For "text" the comparison
> >> > >> is much more expensive, and there are significant speedups.
> >> > >>
> >> > >> For example the test with 100k lookups in array of 10k elements and 
> >> > >> 10%
> >> > >> match probability, the timings are these
> >> > >>
> >> > >>master:  62362 ms
> >> > >>binsearch: 201 ms
> >> > >>bloom:  65 ms
> >> > >>hashjoin:   36 ms
> >> > >>
> >> > >> I do think the explanation is fairly simple - the bloom filter
> >> > >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> > >> some overhead to build and check the bits. The hash join probably
> >> > >> eliminates a lot of the remaining comparisons, because the hash table
> >> > >> is sized to have one tuple per bucket.
> >> > >>
> >> > >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> > >> implementation possible. I'm sure it could be made faster.
> >> > >>
> >> > >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> > >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> > >> should not be doing binary search and instead just build a simple hash
> >> > >> table - that seems much simpler, and it'll probably give us about the
> >> > >> same benefits.
> >> > >
> >> > >That's actually what I originally thought about doing, but I chose
> >> > >binary search since it seemed a lot easier to get off the ground.
> >> > >
> >> >
> >> > OK, that makes perfect sense.
> >> >
> >> > >If we instead build a hash is there anything else we need to be
> >> > >concerned about? For example, work mem? I suppose for the binary
> >> > >search we already have to expand the array, so perhaps it's not all
> >> > >that meaningful relative to that...
> >> > >
> >> >
> >> > I don't think we need to be particularly concerned about work_mem. We
> >> > don't care about it now, and it's not clear to me what we could do about
> >> > it - we already have the array in memory anyway, so it's a bit futile.
> >> > Furthermore, if we need to care about it, it probably applies to the
> >> > binary search too.
> >> >
> >> > >I was looking earlier at what our standard hash implementation was,
> >> > >and it seemed less obvious what was needed to set that up (so binary
> >> > >search seemed a faster proof of concept). If you happen to have any
> >> > >pointers to similar usages I should look at, please let me know.
> >> > >
> >> >
> >> > I think the hash join implementation is far too complicated. It has to
> >> > care about work_mem, so it implements batching, etc. That's a lot of
> >> > complexity we don't need here. IMO we could use either the usual
> >> > dynahash, or maybe even the simpler simplehash.
> >> >
> >> > FWIW it'd be good to verify the numbers I shared, i.e. checking that 

Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-28 Thread Tomas Vondra

On Mon, Apr 27, 2020 at 09:04:09PM -0400, James Coleman wrote:

On Sun, Apr 26, 2020 at 7:41 PM James Coleman  wrote:


On Sun, Apr 26, 2020 at 4:49 PM Tomas Vondra
 wrote:
>
> On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
> >On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
> > wrote:
> >>
> >> On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
> >> >On Sat, Apr 25, 2020 at 5:41 PM David Rowley  wrote:
> >> >>
> >> >> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra 
 wrote:
> >> >> > This reminds me our attempts to add bloom filters to hash joins, 
which I
> >> >> > think ran into mostly the same challenge of deciding when the bloom
> >> >> > filter can be useful and is worth the extra work.
> >> >>
> >> >> Speaking of that, it would be interesting to see how a test where you
> >> >> write the query as IN(VALUES(...)) instead of IN() compares. It would
> >> >> be interesting to know if the planner is able to make a more suitable
> >> >> choice and also to see how all the work over the years to improve Hash
> >> >> Joins compares to the bsearch with and without the bloom filter.
> >> >
> >> >It would be interesting.
> >> >
> >> >It also makes one wonder about optimizing these into to hash
> >> >joins...which I'd thought about over at [1]. I think it'd be a very
> >> >significant effort though.
> >> >
> >>
> >> I modified the script to also do the join version of the query. I can
> >> only run it on my laptop at the moment, so the results may be a bit
> >> different from those I shared before, but it's interesting I think.
> >>
> >> In most cases it's comparable to the binsearch/bloom approach, and in
> >> some cases it actually beats them quite significantly. It seems to
> >> depend on how expensive the comparison is - for "int" the comparison is
> >> very cheap and there's almost no difference. For "text" the comparison
> >> is much more expensive, and there are significant speedups.
> >>
> >> For example the test with 100k lookups in array of 10k elements and 10%
> >> match probability, the timings are these
> >>
> >>master:  62362 ms
> >>binsearch: 201 ms
> >>bloom:  65 ms
> >>hashjoin:   36 ms
> >>
> >> I do think the explanation is fairly simple - the bloom filter
> >> eliminates about 90% of the expensive comparisons, so it's 20ms plus
> >> some overhead to build and check the bits. The hash join probably
> >> eliminates a lot of the remaining comparisons, because the hash table
> >> is sized to have one tuple per bucket.
> >>
> >> Note: I also don't claim the PoC has the most efficient bloom filter
> >> implementation possible. I'm sure it could be made faster.
> >>
> >> Anyway, I'm not sure transforming this to a hash join is worth the
> >> effort - I agree that seems quite complex. But perhaps this suggest we
> >> should not be doing binary search and instead just build a simple hash
> >> table - that seems much simpler, and it'll probably give us about the
> >> same benefits.
> >
> >That's actually what I originally thought about doing, but I chose
> >binary search since it seemed a lot easier to get off the ground.
> >
>
> OK, that makes perfect sense.
>
> >If we instead build a hash is there anything else we need to be
> >concerned about? For example, work mem? I suppose for the binary
> >search we already have to expand the array, so perhaps it's not all
> >that meaningful relative to that...
> >
>
> I don't think we need to be particularly concerned about work_mem. We
> don't care about it now, and it's not clear to me what we could do about
> it - we already have the array in memory anyway, so it's a bit futile.
> Furthermore, if we need to care about it, it probably applies to the
> binary search too.
>
> >I was looking earlier at what our standard hash implementation was,
> >and it seemed less obvious what was needed to set that up (so binary
> >search seemed a faster proof of concept). If you happen to have any
> >pointers to similar usages I should look at, please let me know.
> >
>
> I think the hash join implementation is far too complicated. It has to
> care about work_mem, so it implements batching, etc. That's a lot of
> complexity we don't need here. IMO we could use either the usual
> dynahash, or maybe even the simpler simplehash.
>
> FWIW it'd be good to verify the numbers I shared, i.e. checking that the
> benchmarks makes sense and running it independently. I'm not aware of
> any issues but it was done late at night and only ran on my laptop.

Some quick calculations (don't have the scripting in a form I can
attach yet; using this as an opportunity to hack on a genericized
performance testing framework of sorts) suggest your results are
correct. I was also testing on my laptop, but I showed 1.) roughly
equivalent results for IN (VALUES ...) and IN () for integers,
but when I switch to (short; average 3 characters long) text values I
show the hash join on VALUES is twice as fast as the binary search.

Given that, I'm planning to 

Re: More efficient RI checks - take 2

2020-04-28 Thread Robert Haas
On Thu, Apr 23, 2020 at 10:35 AM Tom Lane  wrote:
> I think we're failing to communicate here.  I agree that if the goal
> is simply to re-implement what the RI triggers currently do --- that
> is, retail one-row-at-a-time checks --- then we could probably dispense
> with all the parser/planner/executor overhead and directly implement
> an indexscan using an API at about the level genam.c provides.
> (The issue of whether it's okay to require an index to be available is
> annoying, but we could always fall back to the old ways if one is not.)
>
> However, what I thought this thread was about was switching to
> statement-level RI checking.  At that point, what we're talking
> about is performing a join involving a not-known-in-advance number
> of tuples on each side.  If you think you can hard-wire the choice
> of join technology and have it work well all the time, I'm going to
> say with complete confidence that you are wrong.  The planner spends
> huge amounts of effort on that and still doesn't always get it right
> ... but it does better than a hard-wired choice would do.

Oh, yeah. If we're talking about that, then getting by without using
the planner doesn't seem feasible. Sorry, I guess I didn't read the
thread carefully enough.

As you say, perhaps there's room for both things, but also as you say,
it's not obvious how to decide intelligently between them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Raw device on PostgreSQL

2020-04-28 Thread Andreas Karlsson

On 4/28/20 10:43 AM, Benjamin Schaller wrote:
for an university project I'm currently doing some research on 
PostgreSQL. I was wondering if hypothetically it would be possible to 
implement a raw device system to PostgreSQL. I know that the 
disadvantages would probably be higher than the advantages compared to 
working with the file system. Just hypothetically: Would it be possible 
to change the source code of PostgreSQL so a raw device system could be 
implemented, or would that cause a chain reaction so that basically one 
would have to rewrite almost the entire code, because too many elements 
of PostgreSQL rely on the file system?


It would require quite a bit of work since 1) PostgreSQL stores its data 
in multiple files and 2) PostgreSQL currently supports only synchronous 
buffered IO.


To get the performance benefits from using raw devices I think you would 
want to add support for asynchronous IO to PostgreSQL rather than 
implementing your own layer to emulate the kernel's buffered IO.


Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It 
was not recorded but the slides are available.


https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

Andreas




Re: Raw device on PostgreSQL

2020-04-28 Thread Stephen Frost
Greetings,

* Benjamin Schaller (benjamin.schal...@s2018.tu-chemnitz.de) wrote:
> for an university project I'm currently doing some research on PostgreSQL. I
> was wondering if hypothetically it would be possible to implement a raw
> device system to PostgreSQL. I know that the disadvantages would probably be
> higher than the advantages compared to working with the file system. Just
> hypothetically: Would it be possible to change the source code of PostgreSQL
> so a raw device system could be implemented, or would that cause a chain
> reaction so that basically one would have to rewrite almost the entire code,
> because too many elements of PostgreSQL rely on the file system?

yes, it'd be possible, no, you wouldn't have to rewrite all of PG.
Instead, if you want it to be performant at all, you'd have to write
lots of new code to do all the things the filesystem and kernel do for
us today.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: PostgreSQL CHARACTER VARYING vs CHARACTER VARYING (Length)

2020-04-28 Thread Ashutosh Bapat
On Tue, Apr 28, 2020 at 2:53 PM Rajin Raj  wrote:
>
> Is there any impact of using the character varying without providing the 
> length while creating tables?
> I have created two tables and inserted 1M records. But I don't see any 
> difference in pg_class. (size, relpage)
>
> create table test_1(name varchar);
> create table test_2(name varchar(50));

I don't think there's a difference in the way these two are stored
on-disk. But if you know that your strings will be at most 50
characters long, better set that limit so that server takes
appropriate action (i.e. truncates the strings to 50).

-- 
Best Wishes,
Ashutosh Bapat




Re: proposal - plpgsql - all plpgsql auto variables should be constant

2020-04-28 Thread Pavel Stehule
út 28. 4. 2020 v 13:35 odesílatel Ashutosh Bapat <
ashutosh.bapat@gmail.com> napsal:

> On Mon, Apr 27, 2020 at 7:56 PM Greg Stark  wrote:
> >
> > On Fri, 24 Apr 2020 at 10:08, Tom Lane  wrote:
> > >
> > > I'm skeptical.  If we'd marked them that way from day one, it would
> have
> > > been fine, but to change it now is a whole different discussion.  I
> think
> > > the odds that anybody will thank us are much smaller than the odds that
> > > there will be complaints.  In particular, I'd be just about certain
> that
> > > there are people out there who are changing FOUND and loop control
> > > variables manually, and they will not appreciate us breaking their
> code.
> >
> > I kind of doubt it would break anybody's code. But I also doubt it's
> > actually going to help anybody. It's not exactly an easy bug to write,
> > so meh, I can't really get worked up either way about this.
>
> We could retain the old behaviour by using a GUC which defaults to old
> behaviour. More GUCs means more confusion, this once guc under plpgsql
> extension might actually help.
>

I am not sure if other GUC can help (in this case). Probably it cannot be
default, and beginners has zero knowledge to enable this or similar GUC.

This week I enhanced plpgsql_check about new check
https://github.com/okbob/plpgsql_check related to this feature.

I afraid so people who needs these checks and some help probably doesn't
know about this extension.





>
> --
> Best Wishes,
> Ashutosh Bapat
>


[pg_dump] 'create index' statement is failing due to search_path is empty

2020-04-28 Thread tushar

Hi ,

While testing something else ,i found 1 scenario  where pg_dump  is failing

Below is the standalone scenario -

--connect to psql terminal and create 2 database

postgres=# create database db1;
CREATE DATABASE
postgres=# create database db2;
CREATE DATABASE

--Connect to database db1 and run these below bunch of sql ( got from 
vacuum.sql file)


\c db1

create  temp table vaccluster (i INT PRIMARY KEY);
ALTER TABLE vaccluster CLUSTER ON vaccluster_pkey;
CLUSTER vaccluster;

CREATE FUNCTION do_analyze() RETURNS VOID VOLATILE LANGUAGE SQL
    AS 'ANALYZE pg_am';
CREATE FUNCTION wrap_do_analyze(c INT) RETURNS INT IMMUTABLE LANGUAGE SQL
    AS 'SELECT $1 FROM do_analyze()';
CREATE INDEX ON vaccluster(wrap_do_analyze(i));
INSERT INTO vaccluster VALUES (1), (2);

--Take the  dump of db1 database  ( ./pg_dump -Fp db1 > /tmp/dump.sql)

--Restore the dump file into db2 database

You are now connected to database "db2" as user "tushar".
db2=# \i /tmp/dump.sql
SET
SET
SET
SET
SET
 set_config


(1 row)

SET
SET
SET
SET
CREATE FUNCTION
ALTER FUNCTION
CREATE FUNCTION
ALTER FUNCTION
SET
SET
CREATE TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
psql:/tmp/dump.sql:71: ERROR:  function do_analyze() does not exist
LINE 1: SELECT $1 FROM do_analyze()
   ^
HINT:  No function matches the given name and argument types. You might 
need to add explicit type casts.

QUERY:  SELECT $1 FROM do_analyze()
CONTEXT:  SQL function "wrap_do_analyze" during inlining
db2=#

Workaround -

reset search_path ; before 'create index' statement in the dump.sql file .

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company





Re: proposal - plpgsql - all plpgsql auto variables should be constant

2020-04-28 Thread Ashutosh Bapat
On Mon, Apr 27, 2020 at 7:56 PM Greg Stark  wrote:
>
> On Fri, 24 Apr 2020 at 10:08, Tom Lane  wrote:
> >
> > I'm skeptical.  If we'd marked them that way from day one, it would have
> > been fine, but to change it now is a whole different discussion.  I think
> > the odds that anybody will thank us are much smaller than the odds that
> > there will be complaints.  In particular, I'd be just about certain that
> > there are people out there who are changing FOUND and loop control
> > variables manually, and they will not appreciate us breaking their code.
>
> I kind of doubt it would break anybody's code. But I also doubt it's
> actually going to help anybody. It's not exactly an easy bug to write,
> so meh, I can't really get worked up either way about this.

We could retain the old behaviour by using a GUC which defaults to old
behaviour. More GUCs means more confusion, this once guc under plpgsql
extension might actually help.

-- 
Best Wishes,
Ashutosh Bapat




Re: Transactions involving multiple postgres foreign servers, take 2

2020-04-28 Thread Muhammad Usama
On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama  wrote:
> >
> > Hi Sawada San,
> >
> > I have been further reviewing and testing the transaction involving
> multiple server patches.
> > Overall the patches are working as expected bar a few important
> exceptions.
> > So as discussed over the call I have fixed the issues I found during the
> testing
> > and also rebased the patches with the current head of the master branch.
> > So can you please have a look at the attached updated patches.
>
> Thank you for reviewing and updating the patch!
>
> >
> > Below is the list of changes I have made on top of V18 patches.
> >
> > 1- In register_fdwxact(), As we are just storing the callback function
> pointers from
> > FdwRoutine in fdw_part structure, So I think we can avoid calling
> > GetFdwRoutineByServerId() in TopMemoryContext.
> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
> > GetFdwRoutineByServerId() call.
>
> Agreed.
>
> >
> >
> > 2- If PrepareForeignTransaction functionality is not present in some FDW
> then
> > during the registration process we should only set the
> XACT_FLAGS_FDWNOPREPARE
> > transaction flag if the modified flag is also set for that server. As
> for the server that has
> > not done any data modification within the transaction we do not do
> two-phase commit anyway.
>
> Agreed.
>
> >
> > 3- I have moved the foreign_twophase_commit in sample file after
> > max_foreign_transaction_resolvers because the default value of
> max_foreign_transaction_resolvers
> > is 0 and enabling the foreign_twophase_commit produces an error with
> default
> > configuration parameter positioning in postgresql.conf
> > Also, foreign_twophase_commit configuration was missing the comments
> > about allowed values in the sample config file.
>
> Sounds good. Agreed.
>
> >
> > 4- Setting ForeignTwophaseCommitIsRequired in
> is_foreign_twophase_commit_required()
> > function does not seem to be the correct place. The reason being, even
> when
> > is_foreign_twophase_commit_required() returns true after setting
> ForeignTwophaseCommitIsRequired
> > to true, we could still end up not using the two-phase commit in the
> case when some server does
> > not support two-phase commit and foreign_twophase_commit is set to
> FOREIGN_TWOPHASE_COMMIT_PREFER
> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to
> PreCommit_FdwXacts()
> > function after doing the prepare transaction.
>
> Agreed.
>
> >
> > 6- In prefer mode, we commit the transaction in single-phase if the
> server does not support
> > the two-phase commit. But instead of doing the single-phase commit right
> away,
> > IMHO the better way is to wait until all the two-phase transactions are
> successfully prepared
> > on servers that support the two-phase. Since an error during a "PREPARE"
> stage would
> > rollback the transaction and in that case, we would end up with
> committed transactions on
> > the server that lacks the support of the two-phase commit.
>
> When an error occurred before the local commit, a 2pc-unsupported
> server could be rolled back or committed depending on the error
> timing. On the other hand all 2pc-supported servers are always rolled
> back when an error occurred before the local commit. Therefore even if
> we change the order of COMMIT and PREPARE it is still possible that we
> will end up committing the part of 2pc-unsupported servers while
> rolling back others including 2pc-supported servers.
>
> I guess the motivation of your change is that since errors are likely
> to happen during executing PREPARE on foreign servers, we can minimize
> the possibility of rolling back 2pc-unsupported servers by deferring
> the commit of 2pc-unsupported server as much as possible. Is that
> right?
>

Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported
servers
after all the PREPAREs are successful is to minimize the chances of partial
commits.
And as you mentioned there will still be chances of getting a partial
commit even with
this approach but the probability of that would be less than what it is
with the
current sequence.



>
> > So I have modified the flow a little bit and instead of doing a
> one-phase commit right away
> > the servers that do not support a two-phase commit is added to another
> list and that list is
> > processed after once we have successfully prepared all the transactions
> on two-phase supported
> > foreign servers. Although this technique is also not bulletproof, still
> it is better than doing
> > the one-phase commits before doing the PREPAREs.
>
> Hmm the current logic seems complex. Maybe we can just reverse the
> order of COMMIT and PREPARE; do PREPARE on all 2pc-supported and
> modified servers first and then do COMMIT on others?
>

Agreed, seems reasonable.

>
> >
> > Also, I think we can improve on this one by throwing an error even in
> PREFER
> > mode 

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-28 Thread Dilip Kumar
On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila  wrote:
>
> On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar  wrote:
> >
> [latest patches]
>
> v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> - Any actions leading to transaction ID assignment are prohibited.
> That, among others,
> + Note that access to user catalog tables or regular system catalog tables
> + in the output plugins has to be done via the
> systable_* scan APIs only.
> + Access via the heap_* scan APIs will error out.
> + Additionally, any actions leading to transaction ID assignment
> are prohibited. That, among others,
> ..
> @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
>   bool valid;
>
>   /*
> + * We don't expect direct calls to heap_fetch with valid
> + * CheckXidAlive for regular tables. Track that below.
> + */
> + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> + !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation
> + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> +
>
> I think comments and code don't match.  In the comment, we are saying
> that via output plugins access to user catalog tables or regular
> system catalog tables won't be allowed via heap_* APIs but code
> doesn't seem to reflect it.  I feel only
> TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> original discussion about this point [1] (Refer "I think it'd also be
> good to add assertions to codepaths not going through systable_*
> asserting that ...").

Right,  So I think we can just add an assert in these function that
Assert(!TransactionIdIsValid(CheckXidAlive)) ?

>
> Isn't it better to block the scan to user catalog tables or regular
> system catalog tables for tableam scan APIs rather than at the heap
> level?  There might be some APIs like heap_getnext where such a check
> might still be required but I guess it is still better to block at
> tableam level.
>
> [1] - 
> https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de

Okay, let me analyze this part.  Because someplace we have to keep at
heap level like heap_getnext and other places at tableam level so it
seems a bit inconsistent.  Also, I think the number of checks might
going to increase because some of the heap functions like
heap_hot_search_buffer are being called from multiple tableam calls,
so we need to put check at every place.

Another point is that I feel some of the checks what we have today
might not be required like heap_finish_speculative, is not fetching
any tuple for us so why do we need to care about this function?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: Proposing WITH ITERATIVE

2020-04-28 Thread Andreas Karlsson
On 4/27/20 6:52 PM, Jonah H. Harris wrote:> If there's any interest, 
I'll clean-up their patch and submit. Thoughts?


Hi,

Do you have any examples of queries where it would help? It is pretty 
hard to say how much value some new syntax adds without seeing how it 
improves an intended use case.


Andreas




Re: Proposing WITH ITERATIVE

2020-04-28 Thread Oleksandr Shulgin
On Tue, Apr 28, 2020 at 5:49 AM Jonah H. Harris 
wrote:

> On Mon, Apr 27, 2020 at 11:33 PM David Fetter  wrote:
>
>>
>> Have the authors agreed to make it available to the project under a
>> compatible license?
>
>
> If there’s interest, obviously. Otherwise I wouldn’t be asking.
>
> I said from the start why I wasn’t attaching a patch and that I was seeing
> feedback. Honestly, David, stop wasting my, and list time, asking
> pointless off-topic questions.
>

Jonah,

I see it the other way round—it could end up as a waste of everyone's time
discussing the details, if the authors don't agree to publish their code in
the first place.  Of course, it could also be written from scratch, in
which case I guess someone else from the community (who haven't seen that
private code) would have to take a stab at it, but I believe it helps to
know this in advance.

I also don't see how it "obviously" follows from your two claims: "I've
found this functionality" and "I'll clean-up their patch and submit", that
you have even asked (or, for that matter—found) the authors of that code.

Finally, I'd like to suggest you adopt a more constructive tone and become
familiar with the project's Code of Conduct[1], if you haven't yet.  I am
certain that constructive, respectful communication from your side will
help the community to focus on the actual details of your proposal.

Kind regards,
-- 
Alex

[1] https://www.postgresql.org/about/policies/coc/


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-28 Thread Amit Kapila
On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar  wrote:
>
[latest patches]

v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
- Any actions leading to transaction ID assignment are prohibited.
That, among others,
+ Note that access to user catalog tables or regular system catalog tables
+ in the output plugins has to be done via the
systable_* scan APIs only.
+ Access via the heap_* scan APIs will error out.
+ Additionally, any actions leading to transaction ID assignment
are prohibited. That, among others,
..
@@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
  bool valid;

  /*
+ * We don't expect direct calls to heap_fetch with valid
+ * CheckXidAlive for regular tables. Track that below.
+ */
+ if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
+ !(IsCatalogRelation(relation) || RelationIsUsedAsCatalogTable(relation
+ elog(ERROR, "unexpected heap_fetch call during logical decoding");
+

I think comments and code don't match.  In the comment, we are saying
that via output plugins access to user catalog tables or regular
system catalog tables won't be allowed via heap_* APIs but code
doesn't seem to reflect it.  I feel only
TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
original discussion about this point [1] (Refer "I think it'd also be
good to add assertions to codepaths not going through systable_*
asserting that ...").

Isn't it better to block the scan to user catalog tables or regular
system catalog tables for tableam scan APIs rather than at the heap
level?  There might be some APIs like heap_getnext where such a check
might still be required but I guess it is still better to block at
tableam level.

[1] - 
https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




PostgreSQL CHARACTER VARYING vs CHARACTER VARYING (Length)

2020-04-28 Thread Rajin Raj
Is there any impact of using the character varying without providing the
length while creating tables?
I have created two tables and inserted 1M records. But I don't see any
difference in pg_class. (size, relpage)

create table test_1(name varchar);
create table test_2(name varchar(50));

insert into test_1 ... 10M records
insert into test_2 ... 10M records

vacuum (full,analyze) db_size_test_1;
vacuum (full,analyze) db_size_test_2;

Which option is recommended?

*Regards,*
*Rajin *


Raw device on PostgreSQL

2020-04-28 Thread Benjamin Schaller

Hey,

for an university project I'm currently doing some research on 
PostgreSQL. I was wondering if hypothetically it would be possible to 
implement a raw device system to PostgreSQL. I know that the 
disadvantages would probably be higher than the advantages compared to 
working with the file system. Just hypothetically: Would it be possible 
to change the source code of PostgreSQL so a raw device system could be 
implemented, or would that cause a chain reaction so that basically one 
would have to rewrite almost the entire code, because too many elements 
of PostgreSQL rely on the file system?


Best regards





Re: Why are wait events not reported even though it reads/writes a timeline history file?

2020-04-28 Thread Masahiro Ikeda

On 2020-04-28 15:09, Michael Paquier wrote:

On Tue, Apr 28, 2020 at 02:49:00PM +0900, Fujii Masao wrote:
Isn't it safer to report the wait event during fgets() rather than 
putting

those calls around the whole loop, like other code does? For example,
writeTimeLineHistory() reports the wait event during read() rather 
than

whole loop.


Yeah, I/O wait events should be taken only during the duration of the
system calls.  Particularly here, you may finish with an elog() that
causes the wait event to be set longer than it should, leading to a
rather incorrect state if a snapshot of pg_stat_activity is taken.
--


Thanks for your comments.

I fixed it to report the wait event during fgets() only.
Please review the v2 patch I attached.

Regard,
--
Masahiro IkedaFrom 20b6eb5bb7af574e4395f653917db5a54d13d7d5 Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Tue, 28 Apr 2020 01:39:25 +
Subject: [PATCH v2] Fix to report wait events about timeline history file.

Even though a timeline history file is read or written,
some wait events are not reported. This patch fixes those issues.
---
 src/backend/access/transam/timeline.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index de57d69..3e1b90f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -123,8 +123,15 @@ readTimeLineHistory(TimeLineID targetTLI)
 	 * Parse the file...
 	 */
 	prevend = InvalidXLogRecPtr;
-	while (fgets(fline, sizeof(fline), fd) != NULL)
+	for (;;)
 	{
+		char	   *result;
+		pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+		result = fgets(fline, sizeof(fline), fd);
+		pgstat_report_wait_end();
+		if (result == NULL)
+			break;
+
 		/* skip leading whitespace and check for # comment */
 		char	   *ptr;
 		TimeLineID	tli;
@@ -393,6 +400,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	nbytes = strlen(buffer);
 	errno = 0;
+	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_WRITE);
 	if ((int) write(fd, buffer, nbytes) != nbytes)
 	{
 		int			save_errno = errno;
@@ -408,6 +416,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 (errcode_for_file_access(),
  errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
+	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-- 
2.9.5



Re: [HACKERS] Restricting maximum keep segments by repslots

2020-04-28 Thread Kyotaro Horiguchi
At Mon, 27 Apr 2020 19:40:07 -0400, Alvaro Herrera  
wrote in 
> On 2020-Apr-08, Kyotaro Horiguchi wrote:
> 
> > I understand how it happens.
> > 
> > The latch triggered by checkpoint request by CHECKPOINT command has
> > been absorbed by ConditionVariableSleep() in
> > InvalidateObsoleteReplicationSlots.  The attached allows checkpointer
> > use MyLatch for other than checkpoint request while a checkpoint is
> > running.
> 
> Hmm, that explanation makes sense, but I couldn't reproduce it with the
> steps you provided.  Perhaps I'm missing something.

Sorry for the incomplete reproducer. A checkpoint needs to be running
simultaneously for the manual checkpoint to hang up.  The following is
the complete sequence.

1. Build a primary database cluster with the following setup, then start it.
   max_slot_wal_keep_size=0
   max_wal_size=32MB
   min_wal_size=32MB

2. Build a replica from the primary creating a slot, then start it.

   $ pg_basebackup -R -C -S s1 -D...
   
3. Try the following commands. Try several times if it succeeds.
  =# create table tt(); drop table tt; select pg_switch_wal();checkpoint;

It is evidently stochastic, but it works quite reliably for me.

> Anyway I think this patch should fix it also -- instead of adding a new
> flag, we just rely on the existing flags (since do_checkpoint must have
> been set correctly from the flags earlier in that block.)

Since the added (!do_checkpoint) check is reached with
do_checkpoint=false at server start and at archive_timeout intervals,
the patch makes checkpointer run a busy-loop at that timings, and that
loop lasts until a checkpoint is actually executed.

What we need to do here is not forgetting the fact that the latch has
been set even if the latch itself gets reset before reaching to
WaitLatch.

> I think it'd be worth to verify this bugfix in a new test.  Would you
> have time to produce that?  I could try in a couple of days ...

The attached patch on 019_replslot_limit.pl does the commands above
automatically. It sometimes succeed but fails in most cases, at least
for me.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/recovery/t/019_replslot_limit.pl b/src/test/recovery/t/019_replslot_limit.pl
index 32dce54522..c8ec4bb363 100644
--- a/src/test/recovery/t/019_replslot_limit.pl
+++ b/src/test/recovery/t/019_replslot_limit.pl
@@ -8,7 +8,7 @@ use TestLib;
 use PostgresNode;
 
 use File::Path qw(rmtree);
-use Test::More tests => 13;
+use Test::More tests => 14;
 use Time::HiRes qw(usleep);
 
 $ENV{PGDATABASE} = 'postgres';
@@ -181,6 +181,36 @@ ok($failed, 'check that replication has been broken');
 
 $node_standby->stop;
 
+my $node_master2 = get_new_node('master2');
+$node_master2->init(allows_streaming => 1);
+$node_master2->append_conf('postgresql.conf', qq(
+min_wal_size = 32MB
+max_wal_size = 32MB
+log_checkpoints = yes
+));
+$node_master2->start;
+$node_master2->safe_psql('postgres', "SELECT pg_create_physical_replication_slot('rep1')");
+$backup_name = 'my_backup2';
+$node_master2->backup($backup_name);
+
+$node_master2->stop;
+$node_master2->append_conf('postgresql.conf', qq(
+max_slot_wal_keep_size = 0
+));
+$node_master2->start;
+
+$node_standby = get_new_node('standby_2');
+$node_standby->init_from_backup($node_master2, $backup_name, has_streaming => 1);
+$node_standby->append_conf('postgresql.conf', "primary_slot_name = 'rep1'");
+$node_standby->start;
+my @result =
+  split('\n', $node_master2->safe_psql('postgres', "
+	   CREATE TABLE tt(); DROP TABLE tt;
+	   SELECT pg_switch_wal();
+	   CHECKPOINT;
+	   SELECT 'finished';", timeout=>'5'));
+is($result[1], 'finished', 'check if checkpoint command is not blocked');
+
 #
 # Advance WAL of $node by $n segments
 sub advance_wal


Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

2020-04-28 Thread Masahiko Sawada
On Tue, 28 Apr 2020 at 11:35, Peter Geoghegan  wrote:
>
> On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan  wrote:
> > I would like to backpatch both patches to all branches that have
> > commit 857f9c36cda -- v11, v12, and master. The second issue isn't
> > serious, but it seems worth keeping v11+ in sync in this area. Note
> > that any backpatch theoretically creates an ABI break for callers of
> > the _bt_pagedel() function.
>
> I plan to commit both fixes, while backpatching to v11, v12 and the
> master branch on Friday -- barring objections.

Thank you for the patches and for starting the new thread.

I agree with both patches.

For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.

@@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf)

for (;;)
{
-   page = BufferGetPage(buf);
+   page = BufferGetPage(leafbuf);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);

/*
 * Internal pages are never deleted directly, only as part of deleting
 * the whole branch all the way down to leaf level.
+*
+* Also check for deleted pages here.  Caller never passes us a fully
+* deleted page.  Only VACUUM can delete pages, so there can't have
+* been a concurrent deletion.  Assume that we reached any deleted
+* page encountered here by following a sibling link, and that the
+* index is corrupt.
 */
-   if (!P_ISLEAF(opaque))
+   Assert(!P_ISDELETED(opaque));
+   if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
{
/*
 * Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1537,13 +1556,21 @@ _bt_pagedel(Relation rel, Buffer buf)
 errmsg("index \"%s\" contains a half-dead
internal page",
RelationGetRelationName(rel)),
 errhint("This can be caused by an interrupted
VACUUM in version 9.3 or older, before upgrade. Please REINDEX
it.")));
-   _bt_relbuf(rel, buf);
+
+   if (P_ISDELETED(opaque))
+   ereport(LOG,
+   (errcode(ERRCODE_INDEX_CORRUPTED),
+errmsg("index \"%s\" contains a deleted page
with a live sibling link",
+   RelationGetRelationName(rel;
+
+   _bt_relbuf(rel, leafbuf);
return ndeleted;
}

I agree with this change but since I'm not sure this change directly
relates to this bug I wonder if it can be a separate patch.

Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Handling of concurrent aborts in logical decoding of in-progress xacts

2020-04-28 Thread Amit Kapila
One of the problems to allow logical decoding of in-progress (or
prepared) transactions is that the transaction which we are decoding
can abort concurrently and we might not be able to detect it which
leads to getting the wrong version of a row from system catalogs.
This can further lead to unpredictable behavior when decoding WAL with
wrong catalog information.

The problem occurs when we have a sequence of operations like
Alter-Abort-Alter as discussed previously in the context of two-phase
transactions [1].  I am reiterating the same problem so as to ease the
discussion for in-progress transactions.

Suppose we created a table (mytable), then in some transaction (say
with txid-200) we are altering it and after that aborting that txn. So
pg_class will have something like this:

xmin | xmax | relname
100  | 200| mytable
200  | 0| mytable

After the previous abort, tuple (100,200,mytable) becomes visible to
other transactions and if we will alter the table again then xmax of
first tuple will be set current xid, resulting in following table:

xmin | xmax | relname
100  | 300| mytable
200  | 0| mytable
300  | 0| mytable

At that moment we’ve lost information that first tuple was deleted by
our txn (txid-200).  And from POV of the historic snapshot, the first
tuple (100, 300, mytable) will become visible and used to decode the
WAL, but actually the second tuple should have been used. Moreover,
such a snapshot could see both tuples (first and second) violating oid
uniqueness.

Now, we are planning to handle such a situation by returning
ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to
the backend decoding a specific uncommitted transaction. The decoding
logic on the receipt of such an sqlerrcode aborts the ongoing
decoding, discard the changes of current (sub)xact and continue
processing the changes of other txns.  The basic idea is that while
setting up historic snapshots (SetupHistoricSnapshot), we remember the
xid corresponding to ReorderBufferTXN (provided it's not committed
yet) whose changes we are going to decode.  Now, after getting the
tuple from a system table scan, we check if the remembered XID is
aborted and if so we return a specific error
ERRCODE_TRANSACTION_ROLLBACK which helps the caller to proceed
gracefully by discarding the changes of such a transaction.  This is
based on the idea/patch [2][3] discussed among Andres Freund and
Nikhil Sontakke in the context of logical decoding of two-phase
transactions.

We need to deal with this problem to allow logical decoding of
in-progress transactions which is being discussed in a separate thread
[4].  I have previously tried to discuss this in the main thread [5]
but didn't get much response so trying again.

Thoughts?

[1] - 
https://www.postgresql.org/message-id/EEBD82AA-61EE-46F4-845E-05B94168E8F2%40postgrespro.ru
[2] - 
https://www.postgresql.org/message-id/20180720145836.gxwhbftuoyx5h4gc%40alap3.anarazel.de
[3] - 
https://www.postgresql.org/message-id/CAMGcDxcBmN6jNeQkgWddfhX8HbSjQpW%3DUo70iBY3P_EPdp%2BLTQ%40mail.gmail.com
[4] - 
https://www.postgresql.org/message-id/688b0b7f-2f6c-d827-c27b-216a8e3ea700%402ndquadrant.com
[5] - 
https://www.postgresql.org/message-id/CAA4eK1KtxLpSp2rP6Rt8izQnPmhiA%3D2QUpLk%2BvoagTjKowc0HA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Pulling up sublink may break join-removal logic

2020-04-28 Thread Richard Guo
Hi hackers,

I happened to notice $subject and not sure if it's an issue or not. When
we're trying to remove a LEFT JOIN, one of the requirements is the inner
side needs to be a single baserel. If there is a join qual that is a
sublink and can be converted to a semi join with the inner side rel, the
inner side would no longer be a single baserel and as a result the LEFT
JOIN can no longer be removed.

Here is an example to illustrate this behavior:

create table a(i int, j int);
create table b(i int UNIQUE, j int);
create table c(i int, j int);

# explain (costs off) select a.i from a left join b on a.i = b.i and
b.j in (select j from c where b.i = c.i);
  QUERY PLAN
---
 Seq Scan on a
(1 row)

For the query above, we do not pull up the sublink and the LEFT JOIN is
removed.


# explain (costs off) select a.i from a left join b on a.i = b.i and
b.j in (select j from c);
  QUERY PLAN
---
 Hash Left Join
   Hash Cond: (a.i = b.i)
   ->  Seq Scan on a
   ->  Hash
 ->  Hash Semi Join
   Hash Cond: (b.j = c.j)
   ->  Seq Scan on b
   ->  Hash
 ->  Seq Scan on c
(9 rows)

Now for this above query, the sublink is pulled up to be a semi-join
with inner side rel 'b', which makes the inner side no longer a single
baserel. That causes the LEFT JOIN failing to be removed.

That is to say, pulling up sublink sometimes breaks join-removal logic.
Is this an issue that bothers you too?

Thanks
Richard


Re: Why are wait events not reported even though it reads/writes a timeline history file?

2020-04-28 Thread Michael Paquier
On Tue, Apr 28, 2020 at 02:49:00PM +0900, Fujii Masao wrote:
> Isn't it safer to report the wait event during fgets() rather than putting
> those calls around the whole loop, like other code does? For example,
> writeTimeLineHistory() reports the wait event during read() rather than
> whole loop.

Yeah, I/O wait events should be taken only during the duration of the
system calls.  Particularly here, you may finish with an elog() that
causes the wait event to be set longer than it should, leading to a
rather incorrect state if a snapshot of pg_stat_activity is taken.
--
Michael


signature.asc
Description: PGP signature


Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators

2020-04-28 Thread Michael Paquier
On Tue, Apr 28, 2020 at 12:56:19PM +0900, Fujii Masao wrote:
> Yes. Attached is the updated version of the patch, which introduces
> +(pg_lsn, numeric) and -(pg_lsn, numeric) operators.
> To implement them, I added also numeric_pg_lsn() function that converts
> numeric to pg_lsn.

-those write-ahead log locations.
+those write-ahead log locations. Also the number of bytes can be added
+into and substracted from LSN using the + and
+- operators, respectively.
That's short.  Should this mention the restriction with numeric (or
just recommend its use) because we don't have a 64b unsigned type
internally, basically Robert's point? 

+   /* XXX would it be better to return NULL? */
+   if (NUMERIC_IS_NAN(num))
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot convert NaN to pg_lsn")));
That would be good to test, and an error sounds fine to me.
--
Michael


signature.asc
Description: PGP signature