date:20210726

Re: Failed transaction statistics to measure the logical replication progress

2021-07-26 Thread Ajin Cherian

On Thu, Jul 8, 2021 at 4:55 PM osumi.takami...@fujitsu.com
 wrote:

> Attached file is the POC patch for this.
> Current design is to save failed stats data in the ReplicationSlot struct.
> This is because after the error, I'm not able to access the ReorderBuffer 
> object.
> Thus, I chose the object where I can interact with at the 
> ReplicationSlotRelease timing.

I think this is a good idea to capture the failed replication stats.
But I'm wondering how you are deciding if
the replication failed or not? Not all cases of ReplicationSLotRelease
are due to a failure. It could also be due to a planned dropping
of subscription or disable of subscription. I have not tested this but
won't the failed stats be updated in this case as well? Is that
correct?

regards,
Ajin Cherian
Fujitsu Australia

Re: Some code cleanup for pgbench and pg_verifybackup

2021-07-26 Thread Michael Paquier

On Tue, Jul 27, 2021 at 06:36:15AM +0200, Fabien COELHO wrote:
> I do not understand your disagreement. Do you disagree about the expected
> semantics of fatal? Then why provide fatal if it should not be used?
> What is the expected usage of fatal?

I disagree about the fact that pgbench uses pg_log_fatal() in ways
that other binaries don't do.  For example, other things use
pg_log_error() followed by an exit(), but not this code.  I am not
going to fight hard on that, though.

That's a set of inconsistences I bumped into while plugging in
option_parse_int() within pgbench.
--
Michael

signature.asc
Description: PGP signature

Re: Slim down integer formatting

2021-07-26 Thread Greg Nancarrow

On Tue, Jul 27, 2021 at 3:08 PM Andrew Gierth
 wrote:
>
> In short, please don't include d2s_intrinsics.h from anywhere other than
> d2s.c
>

Thanks for the clarification.
The patch author will need to take this into account for the patch's
div1e8() usage.

Regards,
Greg Nancarrow
Fujitsu Australia

RE: Added schema level support for publication.

2021-07-26 Thread tanghy.f...@fujitsu.com

On Monday, July 26, 2021 1:25 PM vignesh C 
mailto:vignes...@gmail.com>> wrote:

>

> On Mon, Jul 26, 2021 at 7:41 AM 
> tanghy.f...@fujitsu.com 
> mailto:tanghy.f...@fujitsu.com>> wrote:

> >

> > On Friday, July 23, 2021 8:18 PM vignesh C 
> > mailto:vignes...@gmail.com>> wrote:

> > >

> > > I have changed it to not report any error, this issue is fixed in the

> > > v14 patch attached at [1].

> > > [1] - https://www.postgresql.org/message-

> > > id/CALDaNm3DTj535ezfmm8QHLOtOkcHF2ZcCfSjfR%3DVbTbLZXFRsA%40mail.g

> > > mail.com

> > >

> >

> > Thanks for your new patch. But there's a conflict when apply patch v14 on 
> > the latest HEAD (it seems caused by commit #678f5448c2d869), please rebase 
> > it.

> >

>

> Thanks for reporting it, it is rebased at the v15 patch attached at [1].

> [1] - 
> https://www.postgresql.org/message-id/CALDaNm27LRWF9ney%3DcVeD-0jc2%2BJ5Y0wNQhighZB%3DAat4VbNBA%40mail.gmail.com

Thanks for your new patch. I applied your patch and it succeeded.

Here are some comments on the tests in your patch. The attached file included 
changes about the comments, please have a look.

1.  src/test/regress/sql/publication.sql

There are some existing tests to verify that we can't add table to ‘for all 
tables publication', should we add some tests about adding schema to ‘for all 
tables publication’ / ‘for table publication’?

I added some cases in the attached file.

Besides, the following existing comment seems not suitable. It uses 'SET' but 
the comment says 'add'. And since we can add table or schema, I think we should 
point out what we add is table, thoughts?

-- fail - can't add to for all tables publication

ALTER PUBLICATION testpub_foralltables SET TABLE pub_test.testpub_nopk;

2. src/test/subscription/t/001_rep_changes.pl

2.1

+# Insert some data into few tables and verify that inserted data is replicated.

+$node_publisher->safe_psql('postgres', "INSERT INTO  sch1.tab1 
VALUES(generate_series(11,20))");

+$node_publisher->safe_psql('postgres', "INSERT INTO  sch2.tab1 
VALUES(generate_series(11,20))");

+

+$node_publisher->wait_for_catchup('tap_sub_schema');

There are two spaces between "INTO" and "sch1.tab1".

2.2

Most of publication names are lowercase, and some of them are uppercase. I 
think it will be better if all of them are lowercase.

I modified them in the attached file.

2.3

+# Verify that the subscription relation list is updated after refresh.

+$result = $node_subscriber->safe_psql('postgres',

+"SELECT count(*) FROM pg_subscription_rel WHERE srsubid IN (SELECT oid 
FROM pg_subscription WHERE subname = 'tap_sub_schema')");

+is($result, qq(5),

+'check subscription relation status was dropped on subscriber');

Should it be 'check subscription relation status is not yet dropped on 
subscriber' here?

2.4

+# Drop publications as we don't need them anymore

+$node_publisher->safe_psql('postgres', "DROP PUBLICATION tap_pub_schema");

There is only one publication, so the comment here should be 'Drop publication 
as we don't need it anymore'.

3.

There are some existing test cases about publication for table and publication 
for all tables in 002_pg_dump.pl, so I think we could add some test cases about 
publication for schema.

I tried to add some cases in the attached file.

Regards

Tang

test_patch
Description: test_patch

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-26 Thread Peter Smith

Please find attached the latest patch set v99*

v98-0001 --> split into v99-0001 + v99-0002

Differences:

* Rebased to HEAD @ yesterday.

* Addresses review comments from Amit [1] and split the v98 patch as requested.


[1] 
https://www.postgresql.org/message-id/CAA4eK1%2BizpAybqpEFp8%2BRx%3DC1Z1H_XLcRod_WYjBRv2Rn%2BDO2w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia


v99-0001-Refactor-to-make-common-functions.patch
Description: Binary data


v99-0002-Add-prepare-API-support-for-streaming-transactio.patch
Description: Binary data

Re: [HACKERS] logical decoding of two-phase transactions

2021-07-26 Thread Peter Smith

On Fri, Jul 23, 2021 at 8:08 PM Amit Kapila  wrote:
>
> On Tue, Jul 20, 2021 at 9:24 AM Peter Smith  wrote:
> >
> > Please find attached the latest patch set v98*
> >
>
> Review comments:
> 

All the following review comments are addressed in v99* patch set.

> 1.
> /*
> - * Handle STREAM COMMIT message.
> + * Common spoolfile processing.
> + * Returns how many changes were applied.
>   */
> -static void
> -apply_handle_stream_commit(StringInfo s)
> +static int
> +apply_spooled_messages(TransactionId xid, XLogRecPtr lsn)
>
> Let's extract this common functionality (common to current code and
> the patch) as a separate patch? I think we can commit this as a
> separate patch.
>

Done. Split patches as requested.

> 2.
> apply_spooled_messages()
> {
> ..
>   elog(DEBUG1, "replayed %d (all) changes from file \"%s\"",
>   nchanges, path);
> ..
> }
>
> You have this DEBUG1 message in apply_spooled_messages and its
> callers. You can remove it from callers as the patch already has
> another debug message to indicate whether it is stream prepare or
> stream commit. Also, if this is the only reason to return nchanges
> from apply_spooled_messages() then we can get rid of that as well.
>

Done.

> 3.
> + /*
> + * 2. Mark the transaction as prepared. - Similar code as for
> + * apply_handle_prepare (i.e. two-phase non-streamed prepare)
> + */
> +
> + /*
> + * BeginTransactionBlock is necessary to balance the EndTransactionBlock
> + * called within the PrepareTransactionBlock below.
> + */
> + BeginTransactionBlock();
> + CommitTransactionCommand(); /* Completes the preceding Begin command. */
> +
> + /*
> + * Update origin state so we can restart streaming from correct position
> + * in case of crash.
> + */
> + replorigin_session_origin_lsn = prepare_data.end_lsn;
> + replorigin_session_origin_timestamp = prepare_data.prepare_time;
> +
> + PrepareTransactionBlock(gid);
>
> I think you can move this part into a common function
> apply_handle_prepare_internal. If that is possible then you might want
> to move this part into a common functionality patch as mentioned in
> point-1.
>

Done. (The common function is included in patch 0001)

> 4.
> + xid = logicalrep_read_stream_prepare(s, &prepare_data);
> + elog(DEBUG1, "received prepare for streamed transaction %u", xid);
>
> It is better to have an empty line between the above code lines for
> the sake of clarity.
>

Done.

> 5.
> +/* Commit (and abort) information */
>  typedef struct LogicalRepCommitData
>
> How is this structure related to abort? Even if it is, why this
> comment belongs to this patch?
>

OK. Removed this from the patch.

> 6. Most of the code in logicalrep_write_stream_prepare() and
> logicalrep_write_prepare() is same except for message. I think if we
> want we can handle both of them with a single message by setting some
> flag for stream case but probably there will be some additional
> checking required on the worker-side. What do you think? I think if we
> want to keep them separate then at least we should keep the common
> functionality in logicalrep_write_*/logicalrep_read_* in separate
> functions. This way we will avoid minor inconsistencies in-stream and
> non-stream functions.
>

Done. (The common functions are included in patch 0001).

> 7.
> +++ b/doc/src/sgml/protocol.sgml
> @@ -2881,7 +2881,7 @@ The commands accepted in replication mode are:
> Begin Prepare and Prepare messages belong to the same transaction.
> It also sends changes of large in-progress transactions between a pair of
> Stream Start and Stream Stop messages. The last stream of such a 
> transaction
> -   contains a Stream Commit or Stream Abort message.
> +   contains a Stream Prepare, Stream Commit or Stream Abort message.
>
> I am not sure if it is correct to mention Stream Prepare here because
> after that we will send commit prepared as well for such a
> transaction. So, I think we should remove this change.
>

Done.

> 8.
> -ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
> -
>  \dRs+
>
> +ALTER SUBSCRIPTION regress_testsub SET (slot_name = NONE);
>
> Is there a reason for this change in the tests?
>

Yes, the setting of slot_name = NONE really belongs with the DROP
SUBSCRIPTION. Similarly, the \dRs+ is done to test the effect of the
setting of the streaming option (not the slot_name = NONE). Since I
needed to add a new DROP SUBSCRIPTION (because now the streaming
option works) so I also refactored this exiting test to make all the
test formats consistent.

> 9.
> I think this contains a lot of streaming tests in 023_twophase_stream.
> Let's keep just one test for crash-restart scenario (+# Check that 2PC
> COMMIT PREPARED is decoded properly on crash restart.) where both
> publisher and subscriber get restarted. I think others are covered in
> one or another way by other existing tests. Apart from that, I also
> don't see the need for the below tests:
> # Do DELETE after PREPARE but before COMMIT PREPARED.
> This is mostly t

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-26 Thread Yura Sokolov


Masahiko Sawada писал 2021-07-27 07:06:
On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada 
 wrote:


I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.



I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:

* Test 1
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 57.23 MB  |  0.040 |   98.613 | 572.21 MB  | 0.387 |
1521.981
 intset | 46.88 MB  |  0.114 |   75.944 | 468.67 MB  | 0.961 | 
997.760
 radix  | 40.26 MB  |  0.102 |   18.427 | 336.64 MB  | 0.797 | 
266.146
 rtbm   | 64.02 MB  |  0.234 |   22.443 | 512.02 MB  | 2.230 | 
275.143
 svtm   | 27.28 MB  |  0.060 |   13.568 | 274.07 MB  | 0.476 | 
211.073
 tbm| 96.01 MB  |  0.273 |   10.347 | 768.01 MB  | 2.882 | 
128.103


* Test 2
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 57.23 MB  |  0.041 |4.757 | 572.21 MB  | 0.344 |  
71.228
 intset | 46.88 MB  |  0.127 |3.762 | 468.67 MB  | 1.093 |  
49.573
 radix  | 9.95 MB   |  0.048 |0.679 | 82.57 MB   | 0.371 |  
16.211
 rtbm   | 34.02 MB  |  0.179 |0.534 | 288.02 MB  | 2.092 |  
 8.693
 svtm   | 5.78 MB   |  0.043 |0.239 | 54.60 MB   | 0.342 |  
 7.759
 tbm| 96.01 MB  |  0.274 |0.521 | 768.01 MB  | 2.685 |  
 6.360


* Test 3
select prepare(
100, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 11.45 MB  |  0.009 |   57.698 | 114.45 MB  | 0.076 |
1045.639
 intset | 15.63 MB  |  0.031 |   46.083 | 156.23 MB  | 0.243 | 
848.525
 radix  | 40.26 MB  |  0.063 |   13.755 | 336.64 MB  | 0.501 | 
223.413
 rtbm   | 36.02 MB  |  0.123 |   11.527 | 320.02 MB  | 1.843 | 
180.977
 svtm   | 9.28 MB   |  0.053 |9.631 | 92.59 MB   | 0.438 | 
212.626
 tbm| 96.01 MB  |  0.228 |   10.381 | 768.01 MB  | 2.258 | 
126.630


* Test 4
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 572.21 MB |  0.367 |   78.047 | 5722.05 MB | 3.942 |
1154.776
 intset | 93.74 MB  |  0.777 |   45.146 | 937.34 MB  | 7.716 | 
643.708
 radix  | 40.26 MB  |  0.203 |9.015 | 336.64 MB  | 1.775 | 
133.294
 rtbm   | 36.02 MB  |  0.369 |5.639 | 320.02 MB  | 3.823 |  
88.832
 svtm   | 7.28 MB   |  0.294 |3.891 | 73.60 MB   | 2.690 | 
103.744
 tbm| 96.01 MB  |  0.534 |5.223 | 768.01 MB  | 5.679 |  
60.632



* Test 5
select prepare(
100, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
2 -- page interval
);

There are 1 consecutive pages that have 150 dead tuples at every
2 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| 
shuffled_x10

+---++--++---+-
 array  | 429.16 MB |  0.274 |   75.664 | 4291.54 MB | 3.067 |
1259.501
 intset | 46.88 MB  |  0.559 |   36.449 | 468.67 MB  | 4.565 | 
517.445
 radix  | 20.26 MB  |  0.166 |8.466 | 196.90 MB  | 1.273 | 
166.587
 rtbm   | 18.02 MB  |  0.242 |8.491 | 160.02 MB  | 2.407 | 
171.725
 svtm   | 3.66 MB   |  0.243 |3.635 | 37.10 MB   | 2.022 |  
86.165
 tbm| 48.01 MB

Re: Inaccurate error message when set fdw batch_size to 0

2021-07-26 Thread Bharath Rupireddy

On Mon, Jul 26, 2021 at 9:37 PM Fujii Masao  wrote:
> +  
> +non-negative
> +   
> +Do not use non-negative word in error messages as it looks
> +ambiguous. Instead, use foo must be an integer value greater than
> +zero or foo must be an integer value greater than or equal
> +to zero if option foo expects an integer value.
> +   
> +  
>
> This description looks a bit redundant. And IMO it's better to also document 
> how "non-negative" is ambiguous. So what about the following description, 
> instead? I replaced this description with the following. Patch attached. I 
> also uppercased the first character "n" of "non-negative" at the title for 
> the sake of consistency with other items.
>
> +  
> +Non-negative
> +   
> +Avoid non-negative as it is ambiguous
> +about whether it accepts zero.  It's better to use
> +greater than zero or
> +greater than or equal to zero.
> +   
> +  

LGTM.

> -   /* Number of workers should be non-negative. */
> +   /* Number of parallel workers should be greater than zero. */
> Assert(nworkers >= 0);
>
> This should be "greater than or equal to zero", instead? Anyway since this is 
> comment not an error message, and also there are still other comments using 
> "non-negative", I don't think we need to change only this comment for now. So 
> I excluded this change from the patch. Maybe we can get rid of all 
> "non-negative" from comments and documents later *if* necessary.

+1 to not change any code comments.

> -errmsg("repeat count size must be a 
> non-negative integer")));
> +errmsg("repeat count size must be greater 
> than or equal to zero")));
>
> -errmsg("number of workers must be a 
> non-negative integer")));
> +errmsg("number of workers must be greater 
> than or equal to zero")));
>
> Isn't it better to replace "be greater" with "be an integer value greater"? I 
> applied this to the patch.

+1.

Thanks for the v8 patch, it LGTM.

Regards,
Bharath Rupireddy.

Re: ORDER BY pushdowns seem broken in postgres_fdw

2021-07-26 Thread David Rowley

On Tue, 27 Jul 2021 at 17:20, Ronan Dunklau  wrote:
> One thing in particular I was not sure about was how to fetch the operator
> associated with the path key ordering. I chose to go through the opfamily
> recorded on the member, but maybe locating the original SortGroupClause by its
> ref and getting the operator number here woud have worked. It seems more
> straightforward like this though.

I spent a bit of time trying to find a less invasive way of doing that
and didn't manage to come up with anything. I'm interested to hear if
anyone else has any better ideas.

David

Re: ATTACH PARTITION locking documentation for DEFAULT partitions

2021-07-26 Thread David Rowley

On Tue, 13 Jul 2021 at 02:30, Matthias van de Meent
 wrote:
> The algoritm as described in your patch implies that this recursive
> locking is conditional on _only_ the check-constraints of the topmost
> partition ("performed whilst holding ... and all of its
> sub-partitions, if any"), whereas actually the locking on each
> (sub-)partition is determined by the constraints of the hierarchy down
> to that child partition. It in actuality, this should not matter much,
> but this is a meaningful distinction that I wanted to call out.

I had in mind that was implied, but maybe it's better to be explicit about that.

I've adjusted the patch and attached what I came up with. Let me know
what you think.

I think this can be back-patched as far as 12. Before then we took an
AEL on the partitioned table, so it seems much less important since
any concurrency would be blown out by the AEL.

David

v2_partition_doc_fixes.patch
Description: Binary data

Re: Fwd: Emit namespace in post-copy output

2021-07-26 Thread Corey Huinker

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Passed make check-world. Running make installcheck-world had 2 errors out of 
209, but I got those same 2 errors on a clean branch.

Feature is as-described, and very simple.

As far as I can tell, there is no external specification for vacuum or any 
related utility.

I searched the documentation, and found several examples of the invocation of 
the VACUUM FULL command and vacuuumdb utility, but at no point was sample 
output shown, so this change will not require updating documentation.

The new status of this patch is: Ready for Committer

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-26 Thread Dilip Kumar

On Tue, Jul 27, 2021 at 10:44 AM Amit Kapila  wrote:
>
> On Mon, Jul 26, 2021 at 8:33 PM Robert Haas  wrote:
> Consider below ways to allow the user to specify the parallel-safety option:
>
> (a)
> CREATE TABLE table_name (...) PARALLEL DML { UNSAFE | RESTRICTED | SAFE } ...
> ALTER TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE } ..
>
> OR
>
> (b)
> CREATE TABLE table_name (...) WITH (parallel_dml_enabled = true)
> ALTER TABLE table_name (...) WITH (parallel_dml_enabled = true)
>
> The point was what should we do if the user specifies the option for a
> non-partitioned table. Do we just ignore it or give an error that this
> is not a valid syntax/option when used with non-partitioned tables? I
> find it slightly odd that this option works for partitioned tables but
> gives an error for non-partitioned tables but maybe we can document
> it.

IMHO, for a non-partitioned table, we should be default allow the
parallel safely checking so that users don't have to set it for
individual tables, OTOH, I don't think that there is any point in
blocking the syntax for the non-partitioned table, So I think for the
non-partitioned table if the user hasn't set it we should do automatic
safety checking and if the user has defined the safety externally then
we should respect that.  And for the partitioned table, we will never
do the automatic safety checking and we should always respect what the
user has set.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

pg_receivewal starting position

2021-07-26 Thread Ronan Dunklau

Hello,

I've notived that pg_receivewal logic for deciding which LSN to start 
streaming at consists of:
  - looking up the latest WAL file in our destination folder, and resume from 
here
  - if there isn't, use the current flush location instead.

This behaviour surprised me when using it with a replication slot: I was 
expecting it to start streaming at the last flushed location from the 
replication slot instead. If you consider a backup tool which will take 
pg_receivewal's output and transfer it somewhere else, using the replication 
slot position would be the easiest way to ensure we don't miss WAL files.

Does that make sense ? 

I don't know if it should be the default, toggled by a command line flag, or if 
we even should let the user provide a LSN.

I'd be happy to implement any of that if we agree.

-- 
Ronan Dunklau

Re: Removing unneeded self joins

2021-07-26 Thread Andrey V. Lepikhov


On 7/16/21 12:28 AM, Zhihong Yu wrote:



On Thu, Jul 15, 2021 at 8:25 AM Zhihong Yu > wrote:

bq. We can proof the uniqueness

proof -> prove

Fixed


1. Collect all mergejoinable join quals looks like a.x = b.x

  quals looks like -> quals which look like

For update_ec_sources(), the variable cc is not used.

Fixed

+   *otherjoinquals = rjoinquals;

Maybe rename rjoinquals as ojoinquals to align with the target variable 
name.

Agree, fixed


+   int k; /* Index of kept relation */
+   int r = -1; /* Index of removed relation */

Naming k as kept, r as removed would make the code more readable (remain 
starts with r but has opposite meaning).
I think it is correct now: k - index of inner (keeping) relation. r - of 
outer (removing) relation.


+               if (bms_is_member(r, info->syn_righthand) &&
+                   !bms_is_member(k, info->syn_righthand))
+                   jinfo_check = false;
+
+               if (!jinfo_check)
+                   break;

There are 4 if statements where jinfo_check is assigned false. Once 
jinfo_check is assigned, we can break out of the loop - instead of 
checking the remaining conditions.

Fixed


+           else if (!innerrel_is_unique(root, joinrelids, outer->relids,

nit: the 'else' is not needed since the if block above it goes to next 
iteration of the loop.

Fixed


+           /* See for row marks. */
+           foreach (lc, root->rowMarks)

It seems once imark and omark are set, we can come out of the loop.

Maybe you right. fixed.

--
regards,
Andrey Lepikhov
Postgres Professional
>From e8b4047aa71c808fa5799b2739b2ae0ab7b6d7e3 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" 
Date: Thu, 15 Jul 2021 15:26:13 +0300
Subject: [PATCH] Remove self-joins.

Remove inner joins of a relation to itself if could prove that the join
can be replaced with a scan. We can prove the uniqueness
using the existing innerrel_is_unique machinery.

We can remove a self-join when for each outer row:
1. At most one inner row matches the join clauses.
2. If the join target list contains any inner vars, an inner row
must be (physically) the same row as the outer one.

In this patch we use Rowley's [1] approach to identify a self-join:
1. Collect all mergejoinable join quals which look like a.x = b.x
2. Check innerrel_is_unique() for the qual list from (1). If it
returns true, then outer row matches only the same row from the inner
relation. So proved, that this join is self-join and can be replaced by
a scan.

Some regression tests changed due to self-join removal logic.

[1] https://www.postgresql.org/message-id/raw/CAApHDvpggnFMC4yP-jUO7PKN%3DfXeErW5bOxisvJ0HvkHQEY%3DWw%40mail.gmail.com
---
 src/backend/optimizer/plan/analyzejoins.c | 886 +-
 src/backend/optimizer/plan/planmain.c |   5 +
 src/backend/optimizer/util/joininfo.c |   3 +
 src/backend/optimizer/util/relnode.c  |  26 +-
 src/backend/utils/misc/guc.c  |  10 +
 src/include/optimizer/pathnode.h  |   4 +
 src/include/optimizer/planmain.h  |   2 +
 src/test/regress/expected/equivclass.out  |  32 +
 src/test/regress/expected/join.out| 399 ++
 src/test/regress/expected/sysviews.out|   3 +-
 src/test/regress/sql/equivclass.sql   |  16 +
 src/test/regress/sql/join.sql | 189 +
 12 files changed, 1546 insertions(+), 29 deletions(-)

diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 37eb64bcef..eb9d83b424 100644
--- a/src/backend/optimizer/plan/analyzejoins.c
+++ b/src/backend/optimizer/plan/analyzejoins.c
@@ -22,6 +22,7 @@
  */
 #include "postgres.h"
 
+#include "catalog/pg_class.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/joininfo.h"
@@ -32,10 +33,12 @@
 #include "optimizer/tlist.h"
 #include "utils/lsyscache.h"
 
+bool enable_self_join_removal;
+
 /* local functions */
 static bool join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo);
 static void remove_rel_from_query(PlannerInfo *root, int relid,
-  Relids joinrelids);
+  Relids joinrelids, int subst_relid);
 static List *remove_rel_from_joinlist(List *joinlist, int relid, int *nremoved);
 static bool rel_supports_distinctness(PlannerInfo *root, RelOptInfo *rel);
 static bool rel_is_distinct_for(PlannerInfo *root, RelOptInfo *rel,
@@ -47,6 +50,9 @@ static bool is_innerrel_unique_for(PlannerInfo *root,
    RelOptInfo *innerrel,
    JoinType jointype,
    List *restrictlist);
+static void change_rinfo(RestrictInfo* rinfo, Index from, Index to);
+static Bitmapset* change_relid(Relids relids, Index oldId, Index newId);
+static void change_varno(Expr *expr, Index oldRelid, Index newRelid);
 
 
 /*
@@ -86,7 +92,7 @@ restart:
 
 		remove_rel_from_query(root, innerrelid,
 			  bms_union(sjinfo->min_lefthand,
-		sjinfo->min_righthand));
+		sjinfo->min_righthand), 0);

Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-26 Thread Bharath Rupireddy

On Mon, Jul 26, 2021 at 11:03 PM Bossart, Nathan  wrote:
> > For PreAuthDelay, with the comment I wanted to say that the MyLatch is
> > not the correct one we would want to wait for. Since there is no
> > problem in using it there, I changed the comment to following:
> > /*
> >  * Let's not use WL_LATCH_SET for PreAuthDelay to be consistent with
> >  * PostAuthDelay.
> >  */
>
> How about we elaborate a bit?
>
> WL_LATCH_SET is not used for consistency with PostAuthDelay.
> MyLatch isn't fully initialized for the backend at this point,
> anyway.

+1.

> +   /*
> +* PostAuthDelay will not get applied, if WL_LATCH_SET is 
> used. This
> +* is because the latch could have been set initially.
> +*/
>
> I would suggest the following:
>
> If WL_LATCH_SET is used, PostAuthDelay may not be applied,
> since the latch might already be set.

+1.

> Otherwise, this patch looks good and could probably be marked ready-
> for-committer.

PSA v3 patch.

Regards,
Bharath Rupireddy.


v3-0001-Use-a-WaitLatch-for-pre-post-_auth_delay.patch
Description: Binary data

Re: ORDER BY pushdowns seem broken in postgres_fdw

2021-07-26 Thread Ronan Dunklau

Le mardi 27 juillet 2021, 03:19:18 CEST David Rowley a écrit :
> On Thu, 22 Jul 2021 at 20:49, Ronan Dunklau  wrote:
> > Le jeudi 22 juillet 2021, 09:44:54 CEST David Rowley a écrit :
> > > Can you also use explain (verbose, costs off) the same as the other
> > > tests in that area.  Having the costs there would never survive a run
> > > of the buildfarm. Different hardware will produce different costs, e.g
> > > 32-bit hardware might cost cheaper due to narrower widths.
> > 
> > Sorry about that. Here it is.
> 
> I had a look over the v5 patch and noticed a few issues and a few
> things that could be improved.

Thank you.

> 
> This is not ok:
> 
> +tuple = SearchSysCache4(AMOPSTRATEGY,
> +ObjectIdGetDatum(pathkey->pk_opfamily),
> +em->em_datatype,
> +em->em_datatype,
> +pathkey->pk_strategy);
> 
> SearchSysCache* expects Datum inputs, so you must use the *GetDatum()
> macro for each input that isn't already a Datum.

Noted.

> 
> I also:
> 1. Changed the error message for when that lookup fails so that it's
> the same as the others that perform a lookup with AMOPSTRATEGY.
> 2. Put back the comment in equivclass.c for find_em_expr_for_rel. I
> saw no reason that comment should be changed when the function does
> exactly what it did before.
> 3. Renamed appendOrderbyUsingClause to appendOrderBySuffix. I wasn't
> happy that the name indicated it was only handling USING clauses when
> it also handled ASC/DESC. I also moved in the NULLS FIRST/LAST stuff
> in there

I agree that name is better.


> 4. Adjusted is_foreign_pathkey() to make it easier to read and do
> is_shippable() before calling find_em_expr_for_rel().  I didn't see
> the need to call find_em_expr_for_rel() when is_shippable() returned
> false.
> 5. Adjusted find_em_expr_for_rel() to remove the ternary operator.
> 
> I've attached what I ended up with.

Looks good to me.

> 
> It seems that it was the following commit that introduced the ability
> for sorts to be pushed down to the foreign server, so it would be good
> if the authors of that patch could look over this.

One thing in particular I was not sure about was how to fetch the operator 
associated with the path key ordering. I chose to go through the opfamily 
recorded on the member, but maybe locating the original SortGroupClause by its 
ref and getting the operator number here woud have worked. It seems more 
straightforward like this though.


-- 
Ronan Dunklau

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-26 Thread Amit Kapila

On Mon, Jul 26, 2021 at 8:33 PM Robert Haas  wrote:
>
> On Sat, Jul 24, 2021 at 5:52 AM Amit Kapila  wrote:
> > I think for the consistency argument how about allowing users to
> > specify a parallel-safety option for both partitioned and
> > non-partitioned relations but for non-partitioned relations if users
> > didn't specify, it would be computed automatically? If the user has
> > specified parallel-safety option for non-partitioned relation then we
> > would consider that instead of computing the value by ourselves.
>
> Having the option for both partitioned and non-partitioned tables
> doesn't seem like the worst idea ever, but I am also not entirely sure
> that I understand the point.
>

Consider below ways to allow the user to specify the parallel-safety option:

(a)
CREATE TABLE table_name (...) PARALLEL DML { UNSAFE | RESTRICTED | SAFE } ...
ALTER TABLE table_name PARALLEL DML { UNSAFE | RESTRICTED | SAFE } ..

OR

(b)
CREATE TABLE table_name (...) WITH (parallel_dml_enabled = true)
ALTER TABLE table_name (...) WITH (parallel_dml_enabled = true)

The point was what should we do if the user specifies the option for a
non-partitioned table. Do we just ignore it or give an error that this
is not a valid syntax/option when used with non-partitioned tables? I
find it slightly odd that this option works for partitioned tables but
gives an error for non-partitioned tables but maybe we can document
it.

With the above syntax, even if the user doesn't specify the
parallelism option for non-partitioned relations, we will determine it
automatically. Now, in some situations, users might want to force
parallelism even when we wouldn't have chosen it automatically. It is
possible that she might face an error due to some parallel-unsafe
function but OTOH, she might have ensured that it is safe to choose
parallelism in her particular case.

> > Another reason for hesitation to do automatically for non-partitioned
> > relations was the new invalidation which will invalidate the cached
> > parallel-safety for all relations in relcache for a particular
> > database. As mentioned by Hou-San [1], it seems we need to do this
> > whenever any function's parallel-safety is changed. OTOH, changing
> > parallel-safety for a function is probably not that often to matter in
> > practice which is why I think you seem to be fine with this idea.
>
> Right. I think it should be quite rare, and invalidation events are
> also not crazy expensive. We can test what the worst case is, but if
> you have to sit there and run ALTER FUNCTION in a tight loop to see a
> measurable performance impact, it's not a real problem. There may be a
> code complexity argument against trying to figure it out
> automatically, perhaps, but I don't think there's a big performance
> issue.
>

True, there could be some code complexity but I think we can see once
the patch is ready for review.

> What bothers me is that if this is something people have to set
> manually then many people won't and will not get the benefit of the
> feature. And some of them will also set it incorrectly and have
> problems. So I am in favor of trying to determine it automatically
> where possible, to make it easy for people. However, other people may
> feel differently, and I'm not trying to say they're necessarily wrong.
> I'm just telling you what I think.
>

Thanks for all your suggestions and feedback.

-- 
With Regards,
Amit Kapila.

Re: Slim down integer formatting

2021-07-26 Thread Andrew Gierth

> "David" == David Rowley  writes:

 David> I think the mistake is that the header file is not in
 David> src/include/common. For some reason, it's ended up with all the
 David> .c files in src/common.

 David> I imagine Andrew did this because he didn't ever expect anything
 David> else to have a use for these. He indicates that in [1].

 David> Maybe Andrew can confirm?

It's not that anything else wouldn't have a use for those, it's that
anything else SHOULDN'T have a use for those because they are straight
imports from upstream Ryu code, they aren't guaranteed to work outside
the ranges of values required by Ryu, and if we decided to import a
newer copy of Ryu then it would be annoying if any other code was broken
as a result.

In short, please don't include d2s_intrinsics.h from anywhere other than
d2s.c

-- 
Andrew.

Re: Some code cleanup for pgbench and pg_verifybackup

2021-07-26 Thread Fabien COELHO



Bonjour Michaël-san,


The semantics for fatal vs error is that an error is somehow handled while a
fatal is not. If the log message is followed by an cold exit, ISTM that
fatal is the right call, and I cannot help if other commands do not do that.
ISTM more logical to align other commands to fatal when appropriate.


I disagree.  pgbench is an outlier here.  There are 71 calls to
pg_log_fatal() in src/bin/, and pgbench counts for 54 of them.  It
would be more consistent to align pgbench with the others.


I do not understand your disagreement. Do you disagree about the expected 
semantics of fatal? Then why provide fatal if it should not be used?

What is the expected usage of fatal?

I do not dispute that pgbench is a statistical outlier. However, Pgbench 
is somehow special because it does not handle a lot of errors, hence a lot 
of "fatal & exit" pattern is used, and the user has to restart. There are 
76 calls to "exit" from pgbench, but only 23 from psql which is much 
larger. ISTM that most "normal" pg programs won't do that because they are 
nice and try to handle errors.


So for me "fatal" is the right choice before exiting with a non zero 
status, but if "error" is called instead I do not think it matters much, 
you do as you please.



I was surprised to discover a few weeks ago that pg_log_fatal() did not
terminate the program, which was my expectation.


Mine too.

--
Fabien.

Re: Slim down integer formatting

2021-07-26 Thread David Rowley

On Tue, 27 Jul 2021 at 15:08, Greg Nancarrow  wrote:
>
> On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier  wrote:
> >
> >  #include "common/int.h"
> > +#include "d2s_intrinsics.h"
> > Er, are you sure about this part?  The first version of the patch did
> > that in a different, also incorrect, way.
>
> Er, I was just trying to help out, so at least the patch could be
> applied (whether the patch has merit is a different story).
> Are you saying that it's incorrect to include that header file in this
> source, or that's the wrong way to do it? (i.e. it's wrong to adjust
> the makefile include path to pickup the location where that header is
> located and use #include ""? That header is in src/common,
> which is not on the default include path).
> The method I used certainly works, but you have objections?
> Can you clarify what you say is incorrect?

I think the mistake is that the header file is not in
src/include/common.  For some reason, it's ended up with all the .c
files in src/common.

I imagine Andrew did this because he didn't ever expect anything else
to have a use for these. He indicates that in [1].

Maybe Andrew can confirm?

[1] 
https://www.postgresql.org/message-id/87mup9192t.fsf%40news-spur.riddles.org.uk

David

Re: row filtering for logical replication

2021-07-26 Thread Dilip Kumar

On Tue, Jul 27, 2021 at 6:21 AM houzj.f...@fujitsu.com
 wrote:

> 1) UPDATE a nonkey column in publisher.
> 2) Use debugger to block the walsender process in function
>pgoutput_row_filter_exec_expr().
> 3) Open another psql to connect the publisher, and drop the table which 
> updated
>in 1).
> 4) Unblock the debugger in 2), and then I can see the following error:
> ---
> ERROR:  could not read block 0 in file "base/13675/16391"

Yeah, that's a big problem, seems like the expression evaluation
machinery directly going and detoasting the externally stored data
using some random snapshot.  Ideally, in walsender we can never
attempt to detoast the data because there is no guarantee that those
data are preserved.  Somehow before going to the expression evaluation
machinery, I think we will have to deform that tuple and need to do
something for the externally stored data otherwise it will be very
difficult to control that inside the expression evaluation.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-26 Thread Masahiko Sawada

On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada  wrote:
>
> I'll experiment with the proposed ideas including this idea in more
> scenarios and share the results tomorrow.
>

I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:

* Test 1
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 57.23 MB  |  0.040 |   98.613 | 572.21 MB  | 0.387 |1521.981
 intset | 46.88 MB  |  0.114 |   75.944 | 468.67 MB  | 0.961 | 997.760
 radix  | 40.26 MB  |  0.102 |   18.427 | 336.64 MB  | 0.797 | 266.146
 rtbm   | 64.02 MB  |  0.234 |   22.443 | 512.02 MB  | 2.230 | 275.143
 svtm   | 27.28 MB  |  0.060 |   13.568 | 274.07 MB  | 0.476 | 211.073
 tbm| 96.01 MB  |  0.273 |   10.347 | 768.01 MB  | 2.882 | 128.103

* Test 2
select prepare(
100, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 57.23 MB  |  0.041 |4.757 | 572.21 MB  | 0.344 |  71.228
 intset | 46.88 MB  |  0.127 |3.762 | 468.67 MB  | 1.093 |  49.573
 radix  | 9.95 MB   |  0.048 |0.679 | 82.57 MB   | 0.371 |  16.211
 rtbm   | 34.02 MB  |  0.179 |0.534 | 288.02 MB  | 2.092 |   8.693
 svtm   | 5.78 MB   |  0.043 |0.239 | 54.60 MB   | 0.342 |   7.759
 tbm| 96.01 MB  |  0.274 |0.521 | 768.01 MB  | 2.685 |   6.360

* Test 3
select prepare(
100, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 11.45 MB  |  0.009 |   57.698 | 114.45 MB  | 0.076 |1045.639
 intset | 15.63 MB  |  0.031 |   46.083 | 156.23 MB  | 0.243 | 848.525
 radix  | 40.26 MB  |  0.063 |   13.755 | 336.64 MB  | 0.501 | 223.413
 rtbm   | 36.02 MB  |  0.123 |   11.527 | 320.02 MB  | 1.843 | 180.977
 svtm   | 9.28 MB   |  0.053 |9.631 | 92.59 MB   | 0.438 | 212.626
 tbm| 96.01 MB  |  0.228 |   10.381 | 768.01 MB  | 2.258 | 126.630

* Test 4
select prepare(
100, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 572.21 MB |  0.367 |   78.047 | 5722.05 MB | 3.942 |1154.776
 intset | 93.74 MB  |  0.777 |   45.146 | 937.34 MB  | 7.716 | 643.708
 radix  | 40.26 MB  |  0.203 |9.015 | 336.64 MB  | 1.775 | 133.294
 rtbm   | 36.02 MB  |  0.369 |5.639 | 320.02 MB  | 3.823 |  88.832
 svtm   | 7.28 MB   |  0.294 |3.891 | 73.60 MB   | 2.690 | 103.744
 tbm| 96.01 MB  |  0.534 |5.223 | 768.01 MB  | 5.679 |  60.632


* Test 5
select prepare(
100, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within  a page
1, -- # of consecutive pages having dead tuples
2 -- page interval
);

There are 1 consecutive pages that have 150 dead tuples at every
2 pages.

  name  |  attach   | attach | shuffled |  size_x10  | attach_x10| shuffled_x10
+---++--++---+-
 array  | 429.16 MB |  0.274 |   75.664 | 4291.54 MB | 3.067 |1259.501
 intset | 46.88 MB  |  0.559 |   36.449 | 468.67 MB  | 4.565 | 517.445
 radix  | 20.26 MB  |  0.166 |8.466 | 196.90 MB  | 1.273 | 166.587
 rtbm   | 18.02 MB  |  0.242 |8.491 | 160.02 MB  | 2.407 | 171.725
 svtm   | 3.66 MB   |  0.243 |3.635 | 37.10 MB   | 2.022 |  86.165
 tbm| 48.01 MB  |  0.344 |9.763 | 384.01 MB  | 3.327 | 151.824

* Test 6
select pre

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Bossart, Nathan

On 7/26/21, 5:48 PM, "Andres Freund"  wrote:
> On 2021-07-26 20:27:21 +, Bossart, Nathan wrote:
>> +1.  I was confused by this when working on a WAL pre-allocation
>> patch [0].  Perhaps it could be replaced by a new parameter and a new
>> field in pg_stat_wal.  How about something like log_wal_init_interval,
>> where the value is the minimum amount of time between reporting the
>> number of WAL segments created since the last report?
>
> Why not just make the number in log_checkpoints accurate? There's no
> point in the current number, so we don't need to preserve it...

My understanding is that the statistics logged for log_checkpoints are
just for that specific checkpoint.  From that angle, the value for the
number of WAL files added is technically correct.  Checkpoints will
only ever create zero or one new files via PreallocXlogFiles().  If we
also added all the segments created outside of the checkpoint, the
value for "added" would go from meaning "WAL files created by this
checkpoint" to "WAL files creates since the last checkpoint."  That's
probably less confusing than what's there today, but it's still
slightly different than all the other log_checkpoints stats.

Nathan

Re: Slim down integer formatting

2021-07-26 Thread Greg Nancarrow

On Tue, Jul 27, 2021 at 12:42 PM Michael Paquier  wrote:
>
>  #include "common/int.h"
> +#include "d2s_intrinsics.h"
> Er, are you sure about this part?  The first version of the patch did
> that in a different, also incorrect, way.

Er, I was just trying to help out, so at least the patch could be
applied (whether the patch has merit is a different story).
Are you saying that it's incorrect to include that header file in this
source, or that's the wrong way to do it? (i.e. it's wrong to adjust
the makefile include path to pickup the location where that header is
located and use #include ""? That header is in src/common,
which is not on the default include path).
The method I used certainly works, but you have objections?
Can you clarify what you say is incorrect?

Regards,
Greg Nancarrow
Fujitsu Australia

Re: Slim down integer formatting

2021-07-26 Thread David Rowley

On Tue, 27 Jul 2021 at 14:42, Michael Paquier  wrote:
> When applying some micro-benchmarking to stress those APIs, how much
> does this change things?  At the end of the day, this also comes down
> to an evaluation of pg_ulltoa_n() and pg_ultoa_n().

I'd suggest something like creating a table with, say 1 million INTs
and testing the performance of copy  to '/dev/null';

Repeat for BIGINT

David

Re: Slim down integer formatting

2021-07-26 Thread Michael Paquier

On Tue, Jul 27, 2021 at 12:28:22PM +1000, Greg Nancarrow wrote:
> That patch didn't apply for me (on latest source) so I've attached an
> equivalent with those changes, that does apply, and also tweaks the
> Makefile include path to address that #include issue.

When applying some micro-benchmarking to stress those APIs, how much
does this change things?  At the end of the day, this also comes down
to an evaluation of pg_ulltoa_n() and pg_ultoa_n().

 #include "common/int.h"
+#include "d2s_intrinsics.h"
Er, are you sure about this part?  The first version of the patch did
that in a different, also incorrect, way.
--
Michael


signature.asc
Description: PGP signature

Re: visibility map corruption

2021-07-26 Thread Bruce Momjian

On Sat, Jul 24, 2021 at 10:01:05AM -0400, Bruce Momjian wrote:
> On Fri, Jul 23, 2021 at 09:01:18PM -0400, Bruce Momjian wrote:
> > On Fri, Jul 23, 2021 at 05:47:18PM -0700, Peter Geoghegan wrote:
> > > > I could perhaps see corruption happening if pg_control's oldest xid
> > > > value was closer to the current xid value than it should be, but I can't
> > > > see how having it 2-billion away could cause harm, unless perhaps
> > > > pg_upgrade itself used enough xids to cause the counter to wrap more
> > > > than 2^31 away from the oldest xid recorded in pg_control.
> > > >
> > > > What I am basically asking is how to document this and what it fixes.
> > > 
> > > ISTM that this is a little like commits 78db307bb2 and a61daa14. Maybe
> > > take a look at those?
> > 
> > Agreed.  I just wanted to make sure I wasn't missing an important aspect
> > of this patch.  Thanks.
> 
> Another question --- with the previous code, the oldest xid was always
> set to a reasonable value, -2 billion less than the current xid.  With
> the new code, the oldest xid might be slightly higher than the current
> xid if they use -x but not -u. Is that acceptable?  I think we agreed it
> was.  pg_upgrade will always set both.

This patch has been applied back to 9.6 and will appear in the next
minor release.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: [UNVERIFIED SENDER] Re: pg_upgrade can result in early wraparound on databases with high transaction load

2021-07-26 Thread Bruce Momjian



This patch has been applied back to 9.6 and will appear in the next
minor release.

---

On Tue, May 18, 2021 at 01:26:38PM +0200, Drouvot, Bertrand wrote:
> Hi,
> 
> On 5/4/21 10:17 AM, Drouvot, Bertrand wrote:
> 
> 
> Hi,
> 
> On 4/24/21 3:00 AM, Andres Freund wrote:
> 
> Hi,
> 
> On 2021-04-23 19:28:27 -0500, Justin Pryzby wrote:
> 
> This (combination of) thread(s) seems relevant.
> 
> Subject: pg_upgrade failing for 200+ million Large Objects
> 
> https://www.postgresql.org/message-id/flat/12601596dbbc4c01b86b4ac4d2bd4d48%40EX13D05UWC001.ant.amazon.com
> 
> https://www.postgresql.org/message-id/flat/a9f9376f1c3343a6bb319dce294e20ac%40EX13D05UWC001.ant.amazon.com
> 
> https://www.postgresql.org/message-id/flat/cc089cc3-fc43-9904-fdba-d830d8222145%40enterprisedb.com#3eec85391c6076a4913e96a86fece75e
> 
> Huh. Thanks for digging these up.
> 
> 
> 
> Allows the user to provide a constant via pg_upgrade 
> command-line, that
> overrides the 2 billion constant in pg_resetxlog [1] thereby 
> increasing the
> (window of) Transaction IDs available for pg_upgrade to 
> complete.
> 
> That seems the entirely the wrong approach to me, buying further into
> the broken idea of inventing random wrong values for oldestXid.
> 
> We drive important things like the emergency xid limits off 
> oldestXid. On
> databases with tables that are older than ~147million xids (i.e. not 
> even
> affected by the default autovacuum_freeze_max_age) the current 
> constant leads
> to setting the oldestXid to a value *in the future*/wrapped around. 
> Any
> different different constant (or pg_upgrade parameter) will do that 
> too in
> other scenarios.
> 
> As far as I can tell there is precisely *no* correct behaviour here 
> other than
> exactly copying the oldestXid limit from the source database.
> 
> 
> Please find attached a patch proposal doing so: it adds a new (- u)
> parameter to pg_resetwal that allows to specify the oldest unfrozen XID to
> set.
> Then this new parameter is being used in pg_upgrade to copy the source
> Latest checkpoint's oldestXID.
> 
> Questions:
> 
>   □ Should we keep the old behavior in case -x is being used without -u?
> (The proposed patch does not set an arbitrary oldestXID anymore in 
> case
> -x is used.)
>   □ Also shouldn't we ensure that the xid provided with -x or -u is >=
> FirstNormalTransactionId (Currently the only check is that it is # 0)?
> 
> 
> Copy/pasting Andres feedback (Thanks Andres for this feedback) on those
> questions from another thread [1].
> 
> > I was also wondering if:
> >
> > * We should keep the old behavior in case pg_resetwal -x is being used
> > without -u?  (The proposed patch does not set an arbitrary oldestXID
> > anymore in  case -x is used)
> 
> Andres: I don't think we should. I don't see anything in the old behaviour
> worth
> maintaining.
> 
> > * We should ensure that the xid provided with -x or -u is
> > >= FirstNormalTransactionId (Currently the only check is that it is
> > # 0)?
> 
> Andres: Applying TransactionIdIsNormal() seems like a good idea.
> 
> => I am attaching a new version that makes use of TransactionIdIsNormal()
> checks.
> 
> Andres: I think it's important to verify that the xid provided with -x is
> within a reasonable range of the oldest xid.
> 
> => What do you mean by "a reasonable range"?
> 
> Thanks
> 
> Bertrand
> 
> [1]: https://www.postgresql.org/message-id/
> 20210517185646.pwe4klaufwmdhe2a%40alap3.anarazel.de
> 
> 
> 
> 

>  src/bin/pg_resetwal/pg_resetwal.c | 65 
> ++-
>  src/bin/pg_upgrade/controldata.c  | 17 +-
>  src/bin/pg_upgrade/pg_upgrade.c   |  6 
>  src/bin/pg_upgrade/pg_upgrade.h   |  1 +
>  4 files changed, 60 insertions(+), 29 deletions(-)
> diff --git a/src/bin/pg_resetwal/pg_resetwal.c 
> b/src/bin/pg_resetwal/pg_resetwal.c
> index 805dafef07..5e864760ed 100644
> --- a/src/bin/pg_resetwal/pg_resetwal.c
> +++ b/src/bin/pg_resetwal/pg_resetwal.c
> @@ -65,6 +65,7 @@ static bool guessed = false;/* T if we had to guess 
> at any values */
>  static const char *progname;
>  static uint32 set_xid_epoch = (uint32) -1;
>  static TransactionId set_xid = 0;
> +static TransactionId set_oldest_unfrozen_xid = 0;
>  static TransactionId set_oldest_commit_ts_xid = 0;
>  static TransactionId set_newest_commit_ts_xid = 0;
>  static Oid   set_oid = 0;
> @@ -102,6 +103,7 @@ main(int argc, char *argv[])
>   {"next-oid", required_argument, NULL, 'o'},
>   {"multixact-offset", required_argument, NULL, 'O'},
>   {"next-transaction-id", required_argument, NULL, 'x'},
> + {"oldest-transaction-id"

Re: Allow batched insert during cross-partition updates

2021-07-26 Thread Amit Langote

On Thu, Jul 22, 2021 at 2:18 PM vignesh C  wrote:
> On Fri, Jul 2, 2021 at 7:35 AM Amit Langote  wrote:
> > On Fri, Jul 2, 2021 at 1:39 AM Tom Lane  wrote:
> > >
> > > Amit Langote  writes:
> > > > [ v6-0001-Allow-batching-of-inserts-during-cross-partition-.patch ]
> > >
> > > Per the cfbot, this isn't applying anymore, so I'm setting it back
> > > to Waiting on Author.
> >
> > Rebased patch attached.  Thanks for the reminder.
>
> One of the test postgres_fdw has failed, can you please post an
> updated patch for the fix:
> test postgres_fdw ... FAILED (test process exited with
> exit code 2) 7264 ms

Thanks Vignesh.

I found a problem with the underlying batching code that caused this
failure and have just reported it here:

https://www.postgresql.org/message-id/CA%2BHiwqEWd5B0-e-RvixGGUrNvGkjH2s4m95%3DJcwUnyV%3Df0rAKQ%40mail.gmail.com

Here's v8, including the patch for the above problem.

-- 
Amit Langote
EDB: http://www.enterprisedb.com


v8-0002-Allow-batching-of-inserts-during-cross-partition-.patch
Description: Binary data


v8-0001-Fix-a-thinko-in-b676ac443b6.patch
Description: Binary data

Re: Slim down integer formatting

2021-07-26 Thread Greg Nancarrow

On Tue, Jul 27, 2021 at 9:51 AM David Fetter  wrote:
>
> In covering the int64 versions, I swiped a light weight division from
> the Ryu stuff.  I'm pretty sure that what I did is not how to do
> #includes, but it's a PoC.  What would be a better way to do this?
>

That patch didn't apply for me (on latest source) so I've attached an
equivalent with those changes, that does apply, and also tweaks the
Makefile include path to address that #include issue.

Regards,
Greg Nancarrow
Fujitsu Australia

v2-0001-PoC-Slim-down-integer-printing-some-more.patch
Description: Binary data

a thinko in b676ac443b6

2021-07-26 Thread Amit Langote

Hi,

I noticed $subject while rebasing my patch at [1] to enable batching
for the inserts used in cross-partition UPDATEs.

b676ac443b6 did this:

-   resultRelInfo->ri_PlanSlots[resultRelInfo->ri_NumSlots] =
-   MakeSingleTupleTableSlot(planSlot->tts_tupleDescriptor,
-planSlot->tts_ops);
...
+   {
+   TupleDesc tdesc =
CreateTupleDescCopy(slot->tts_tupleDescriptor);
+
+   resultRelInfo->ri_Slots[resultRelInfo->ri_NumSlots] =
+   MakeSingleTupleTableSlot(tdesc, slot->tts_ops);
...
+   resultRelInfo->ri_PlanSlots[resultRelInfo->ri_NumSlots] =
+   MakeSingleTupleTableSlot(tdesc, planSlot->tts_ops);

I think it can be incorrect to use the same TupleDesc for both the
slots in ri_Slots (for ready-to-be-inserted tuples) and ri_PlanSlots
(for subplan output tuples).  Especially if you consider what we did
in 86dc90056df that was committed into v14.  In that commit, we
changed the way a subplan under ModifyTable produces its output for an
UPDATE statement.  Previously, it would produce a tuple matching the
target table's TupleDesc exactly (plus any junk columns), but now it
produces only a partial tuple containing the values for the changed
columns.

So it's better to revert to using planSlot->tts_tupleDescriptor for
the slots in ri_PlanSlots.  Attached a patch to do so.

-- 
Amit Langote
EDB: http://www.enterprisedb.com

[1] https://commitfest.postgresql.org/33/2992/


b676ac443b6-thinko-fix.patch
Description: Binary data

Re: Fix around conn_duration in pgbench

2021-07-26 Thread Yugo NAGATA

Hello Fujii-san,

Thank you for looking at it.

On Tue, 27 Jul 2021 03:04:35 +0900
Fujii Masao  wrote:

>   case CSTATE_FINISHED:
> + /* per-thread last disconnection time is not 
> measured */
> 
> Could you tell me why we don't need to do this measurement?

We don't need to do it because it is already done in CSTATE_END_TX state when
the transaction successfully finished. Also, we don't need it when the thread
is aborted (that it, in CSTATE_ABORTED case) because we can't report complete
results anyway in such cases.

I updated the comment.
 
> - /* no connection delay to record */
> - thread->conn_duration = 0;
> + /* connection delay is measured globally between the barriers */
> 
> This comment is really correct? I was thinking that the measurement is not 
> necessary here because this is the case where -C option is not specified.

This comment means that, when -C is not specified, the connection delay is
measured between the barrier point where the benchmark starts

 /* READY */
 THREAD_BARRIER_WAIT(&barrier);

and the barrier point where all the thread finish making initial connections.

 /* GO */
 THREAD_BARRIER_WAIT(&barrier);


> done:
>   start = pg_time_now();
>   disconnect_all(state, nstate);
>   thread->conn_duration += pg_time_now() - start;
> 
> We should measure the disconnection time here only when -C option specified 
> (i.e., is_connect variable is true)? Though, I'm not sure how much this 
> change is helpful to reduce the performance overhead

You are right. We are measuring the disconnection time only when -C option is
specified, but it is already done at the end of transaction (i.e., 
CSTATE_END_TX). 
We need disconnection here only when we get an error. 
Therefore, we don't need the measurement here.

I attached the updated patch.

Regards,
Yugo Nagata

-- 
Yugo NAGATA 
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 364b5a2e47..ffd74db3ad 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -3545,8 +3545,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
 
 if (is_connect)
 {
+	pg_time_usec_t start = now;
+
+	pg_time_now_lazy(&start);
 	finishCon(st);
-	now = 0;
+	now = pg_time_now();
+	thread->conn_duration += now - start;
 }
 
 if ((st->cnt >= nxacts && duration <= 0) || timer_exceeded)
@@ -3570,6 +3574,12 @@ advanceConnectionState(TState *thread, CState *st, StatsData *agg)
  */
 			case CSTATE_ABORTED:
 			case CSTATE_FINISHED:
+/*
+ * Per-thread last disconnection time is not measured because it
+ * is already done when the transaction successfully finished.
+ * Also, we don't need it when the thread is aborted because we
+ * can't report complete results anyway in such cases.
+ */
 finishCon(st);
 return;
 		}
@@ -6615,6 +6625,7 @@ threadRun(void *arg)
 
 	thread_start = pg_time_now();
 	thread->started_time = thread_start;
+	thread->conn_duration = 0;
 	last_report = thread_start;
 	next_report = last_report + (int64) 100 * progress;
 
@@ -6638,14 +6649,7 @@ threadRun(void *arg)
 goto done;
 			}
 		}
-
-		/* compute connection delay */
-		thread->conn_duration = pg_time_now() - thread->started_time;
-	}
-	else
-	{
-		/* no connection delay to record */
-		thread->conn_duration = 0;
+		/* connection delay is measured globally between the barriers */
 	}
 
 	/* GO */
@@ -6846,9 +6850,7 @@ threadRun(void *arg)
 	}
 
 done:
-	start = pg_time_now();
 	disconnect_all(state, nstate);
-	thread->conn_duration += pg_time_now() - start;
 
 	if (thread->logfile)
 	{

Re: Incorrect usage of strtol, atoi for non-numeric junk inputs

2021-07-26 Thread Michael Paquier

On Mon, Jul 26, 2021 at 05:46:22PM +0900, Kyotaro Horiguchi wrote:
> Thanks for revising and committing! I'm fine with all of the recent
> discussions on the committed part. Though I don't think it works for
> "live" command line options, making the omitting policy symmetric
> looks good. I see the same done in several similar use of strto[il].

OK, applied this one.  So for now we are done here.

> The change in 020_pg_receivewal.pl results in a chain of four
> successive failures, which is fine. But the last failure (#23) happens
> for a bit dubious reason.

Yes, I saw that as well.  I'll address that separately.
--
Michael


signature.asc
Description: PGP signature

Re: shared-memory based stats collector

2021-07-26 Thread Andres Freund

Hi,

On 2021-07-26 17:52:01 +0900, Kyotaro Horiguchi wrote:
> > > Yeah, thank you very much for checking that. However, this patch is
> > > now developed in Andres' GitHub repository.  So I'm at a loss what to
> > > do for the failure..
> > 
> > I'll post a rebased version soon.
> 
> (Sorry if you feel being hurried, which I didn't meant to.)

No worries!

I had intended to post a rebase by now. But while I did mostly finish
that (see [1]) I unfortunately encountered a new issue around
partitioned tables, see [2]. Currently I'm hoping for a few thoughts on
that thread about the best way to address the issues.

Greetings,

Andres Freund

[1] https://github.com/anarazel/postgres/tree/shmstat
[2] 
https://www.postgresql.org/message-id/20210722205458.f2bug3z6qzxzpx2s%40alap3.anarazel.de

Re: ORDER BY pushdowns seem broken in postgres_fdw

2021-07-26 Thread David Rowley

On Thu, 22 Jul 2021 at 20:49, Ronan Dunklau  wrote:
>
> Le jeudi 22 juillet 2021, 09:44:54 CEST David Rowley a écrit :
> > Can you also use explain (verbose, costs off) the same as the other
> > tests in that area.  Having the costs there would never survive a run
> > of the buildfarm. Different hardware will produce different costs, e.g
> > 32-bit hardware might cost cheaper due to narrower widths.
> >
>
> Sorry about that. Here it is.

I had a look over the v5 patch and noticed a few issues and a few
things that could be improved.

This is not ok:

+tuple = SearchSysCache4(AMOPSTRATEGY,
+ObjectIdGetDatum(pathkey->pk_opfamily),
+em->em_datatype,
+em->em_datatype,
+pathkey->pk_strategy);

SearchSysCache* expects Datum inputs, so you must use the *GetDatum()
macro for each input that isn't already a Datum.

I also:
1. Changed the error message for when that lookup fails so that it's
the same as the others that perform a lookup with AMOPSTRATEGY.
2. Put back the comment in equivclass.c for find_em_expr_for_rel. I
saw no reason that comment should be changed when the function does
exactly what it did before.
3. Renamed appendOrderbyUsingClause to appendOrderBySuffix. I wasn't
happy that the name indicated it was only handling USING clauses when
it also handled ASC/DESC. I also moved in the NULLS FIRST/LAST stuff
in there
4. Adjusted is_foreign_pathkey() to make it easier to read and do
is_shippable() before calling find_em_expr_for_rel().  I didn't see
the need to call find_em_expr_for_rel() when is_shippable() returned
false.
5. Adjusted find_em_expr_for_rel() to remove the ternary operator.

I've attached what I ended up with.

It seems that it was the following commit that introduced the ability
for sorts to be pushed down to the foreign server, so it would be good
if the authors of that patch could look over this.

commit f18c944b6137329ac4a6b2dce5745c5dc21a8578
Author: Robert Haas 
Date:   Tue Nov 3 12:46:06 2015 -0500

postgres_fdw: Add ORDER BY to some remote SQL queries.

David

v6_fix_postgresfdw_orderby_handling.patch
Description: Binary data

RE: row filtering for logical replication

2021-07-26 Thread houzj.f...@fujitsu.com

On July 23, 2021 6:16 PM Amit Kapila  wrote:
> On Fri, Jul 23, 2021 at 2:27 PM Rahila Syed  wrote:
> >
> > The column comparison for row filtering happens before the unchanged
> > toast columns are filtered. Unchanged toast columns are filtered just
> > before writing the tuple to output stream.
> >
> 
> To perform filtering, you need to use the tuple from WAL and that tuple 
> doesn't
> seem to have unchanged toast values, so how can we do filtering? I think it 
> is a
> good idea to test this once.

I agreed.

Currently, both unchanged toasted key column and unchanged toasted non-key
column is not logged. So, we cannot get the toasted value directly for these
columns when doing row filtering.

I tested the current patch for toasted data and found a problem: In the current
patch, it will try to fetch the toast data from toast table when doing row
filtering[1]. But, it's unsafe to do that in walsender. We can see it use
HISTORIC snapshot in heap_fetch_toast_slice() and also the comments of
init_toast_snapshot() have said "Detoasting *must* happen in the same
transaction that originally fetched the toast pointer.". The toast data could
have been changed when doing row filtering. For exmaple, I tested the following
steps and get an error.

1) UPDATE a nonkey column in publisher.
2) Use debugger to block the walsender process in function
   pgoutput_row_filter_exec_expr().
3) Open another psql to connect the publisher, and drop the table which updated
   in 1).
4) Unblock the debugger in 2), and then I can see the following error:
---
ERROR:  could not read block 0 in file "base/13675/16391"
---

[1]
(1)--publisher--
CREATE TABLE toasted_key (
id serial,
toasted_key text PRIMARY KEY,
toasted_col1 text,
toasted_col2 text
);
select repeat('99', 200) as tvalue \gset
CREATE PUBLICATION pub FOR TABLE toasted_key WHERE (toasted_col2 = :'tvalue');
ALTER TABLE toasted_key REPLICA IDENTITY USING INDEX toasted_key_pkey;
ALTER TABLE toasted_key ALTER COLUMN toasted_key SET STORAGE EXTERNAL;
ALTER TABLE toasted_key ALTER COLUMN toasted_col1 SET STORAGE EXTERNAL;
ALTER TABLE toasted_key ALTER COLUMN toasted_col2 SET STORAGE EXTERNAL;
INSERT INTO toasted_key(toasted_key, toasted_col1, toasted_col2) 
VALUES(repeat('1234567890', 200), repeat('9876543210', 200), 
repeat('99', 200));

(2)--subscriber--
CREATE TABLE toasted_key (
id serial,
toasted_key text PRIMARY KEY,
toasted_col1 text,
toasted_col2 text
);

CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres port=1' PUBLICATION pub;

(3)--publisher--
UPDATE toasted_key SET toasted_col1 = repeat('113113', 200);

Based on the above steps, the row filter will ge through the following path
and fetch toast data in walsender.
--
pgoutput_row_filter_exec_expr
...
texteq
...
text *targ1 = DatumGetTextPP(arg1);
pg_detoast_datum_packed
detoast_attr
--

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Andres Freund

Hi,

On 2021-07-25 12:10:07 +0900, Fujii Masao wrote:
> It's also worth showing them in monitoring stats view like pg_stat_wal?

I'm not convinced that's all that meaningful. It makes sense to include
it as part of the checkpoint output, because checkpoints determine when
WAL can be recycled etc. It's not that clear to me how to represent that
as part of pg_stat_wal?

I guess we could add columns for the amount of WAL has been a) newly
created b) recycled c) removed. In combination that *does* seem
useful. But also a mostly independent change...

Greetings,

Andres Freund

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Andres Freund

Hi,

On 2021-07-26 20:27:21 +, Bossart, Nathan wrote:
> +1.  I was confused by this when working on a WAL pre-allocation
> patch [0].  Perhaps it could be replaced by a new parameter and a new
> field in pg_stat_wal.  How about something like log_wal_init_interval,
> where the value is the minimum amount of time between reporting the
> number of WAL segments created since the last report?

Why not just make the number in log_checkpoints accurate? There's no
point in the current number, so we don't need to preserve it...

Greetings,

Andres Freund

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Bossart, Nathan

On 7/26/21, 5:23 PM, "Fujii Masao"  wrote:
> On 2021/07/27 5:27, Bossart, Nathan wrote:
>> +1.  I was confused by this when working on a WAL pre-allocation
>> patch [0].  Perhaps it could be replaced by a new parameter and a new
>> field in pg_stat_wal.  How about something like log_wal_init_interval,
>> where the value is the minimum amount of time between reporting the
>> number of WAL segments created since the last report?
>
> You mean to introduce new GUC like log_wal_init_interval and that
> the number of WAL files created since the last report will be logged
> every that interval? But isn't it better and simpler to just expose
> the accumulated number of WAL files created, in pg_stat_wal view
> or elsewhere? If so, we can easily get to know the number of WAL files
> created in every interval by checking the view and calculating the diff.

I agree with you about adding a new field to pg_stat_wal.  The
parameter would just be a convenient way of logging this information
for future reference.  I don't feel strongly about the parameter if
you think the pg_stat_wal addition is enough.

Nathan

Re: needless complexity in StartupXLOG

2021-07-26 Thread Justin Pryzby

On Mon, Jul 26, 2021 at 03:53:20PM -0400, Robert Haas wrote:
> Yeah, and there again, it's a lot easier to test if we don't have as
> many cases. Now no single change is going to fix that, but the number
> of flag variables in xlog.c is simply bonkers. This particular stretch
> of code uses 3 of them to even decide whether to attempt the test in
> question, and all of those are set in complex ways depending on the
> values of still more flag variables. The comments where
> bgwriterLaunched is set claim that we only do this during archive
> recovery, not crash recovery, but the code depends on the value of
> ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
> can't get the bgwriter to run even during crash recovery in the
> scenario described by:

I'm not following along closely and maybe you're already aware of this one?
https://commitfest.postgresql.org/33/2706/
Background writer and checkpointer in crash recovery

@Thomas:
https://www.postgresql.org/message-id/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com
>  * It's possible that archive recovery was requested, but we don't
>  * know how far we need to replay the WAL before we reach consistency.
>  * This can happen for example if a base backup is taken from a
>  * running server using an atomic filesystem snapshot, without calling
>  * pg_start/stop_backup. Or if you just kill a running primary server
>  * and put it into archive recovery by creating a recovery signal
>  * file.
> 
> If we ran the bgwriter all the time during crash recovery, we'd know
> for sure whether that causes any problems. If we only do it like this
> in certain corner cases, it's much more likely that we have bugs.
> Grumble, grumble.

-- 
Justin

Re: list of extended statistics on psql (\dX)

2021-07-26 Thread Tatsuro Yamada


Hi Tomas and Justin,

On 2021/07/27 4:26, Tomas Vondra wrote:

Hi,

I've pushed the last version of the fix, including the regression tests etc. 
Backpatch to 14, where \dX was introduced.



Thank you!


Regards,
Tatsuro Yamada

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Fujii Masao





On 2021/07/27 5:27, Bossart, Nathan wrote:

+1.  I was confused by this when working on a WAL pre-allocation
patch [0].  Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal.  How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?


You mean to introduce new GUC like log_wal_init_interval and that
the number of WAL files created since the last report will be logged
every that interval? But isn't it better and simpler to just expose
the accumulated number of WAL files created, in pg_stat_wal view
or elsewhere? If so, we can easily get to know the number of WAL files
created in every interval by checking the view and calculating the diff.

Regards,


--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Some code cleanup for pgbench and pg_verifybackup

2021-07-26 Thread Michael Paquier

On Mon, Jul 26, 2021 at 03:35:29PM -0400, Alvaro Herrera wrote:
> On 2021-Jul-26, Fabien COELHO wrote:
>> The semantics for fatal vs error is that an error is somehow handled while a
>> fatal is not. If the log message is followed by an cold exit, ISTM that
>> fatal is the right call, and I cannot help if other commands do not do that.
>> ISTM more logical to align other commands to fatal when appropriate.

I disagree.  pgbench is an outlier here.  There are 71 calls to
pg_log_fatal() in src/bin/, and pgbench counts for 54 of them.  It
would be more consistent to align pgbench with the others.

> I was surprised to discover a few weeks ago that pg_log_fatal() did not
> terminate the program, which was my expectation.  If every single call
> to pg_log_fatal() is followed by exit(1), why not make pg_log_fatal()
> itself exit?  Apparently this coding pattern confuses many people -- for
> example pg_verifybackup.c lines 291ff fail to exit(1) after "throwing" a
> fatal error, while the block at lines 275 does the right thing.

I remember having the same reaction when those logging APIs got
introduced (I may be wrong here), and I also recall that this point
has been discussed, where the conclusion was that the logging should
never exit() directly.

> (In reality we cannot literally just exit(1) in pg_log_fatal(), because
> there are quite a few places that do some other thing after the log
> call and before exit(1), or terminate the program in some other way than
> exit().)

Yes, that's obviously wrong.  I am wondering if there is more of
that.
--
Michael


signature.asc
Description: PGP signature

Slim down integer formatting

2021-07-26 Thread David Fetter

Folks,

Please find attached a patch to do $subject. It's down to a one table
lookup and 3 instructions.

In covering the int64 versions, I swiped a light weight division from
the Ryu stuff.  I'm pretty sure that what I did is not how to do
#includes, but it's a PoC.  What would be a better way to do this?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 1cf7202facd9fee161865d90304e5ede1e3c65cf Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 26 Jul 2021 16:43:43 -0700
Subject: [PATCH v1] PoC: Slim down integer printing some more

---
 src/backend/utils/adt/numutils.c | 62 +++-
 1 file changed, 29 insertions(+), 33 deletions(-)

diff --git src/backend/utils/adt/numutils.c src/backend/utils/adt/numutils.c
index b93096f288..c4f2d5cfb1 100644
--- src/backend/utils/adt/numutils.c
+++ src/backend/utils/adt/numutils.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 
+#include "../common/d2s_intrinsics.h"
 #include "common/int.h"
 #include "utils/builtins.h"
 #include "port/pg_bitutils.h"
@@ -39,50 +40,45 @@ static const char DIGIT_TABLE[200] =
 "90" "91" "92" "93" "94" "95" "96" "97" "98" "99";
 
 /*
- * Adapted from http://graphics.stanford.edu/~seander/bithacks.html#IntegerLog10
+ * Adapted from
+ * https://lemire.me/blog/2021/06/03/computing-the-number-of-digits-of-an-integer-even-faster/
  */
 static inline int
 decimalLength32(const uint32 v)
 {
-	int			t;
-	static const uint32 PowersOfTen[] = {
-		1, 10, 100,
-		1000, 1, 10,
-		100, 1000, 1,
-		10
+	/*
+	 * Each element in the array, when added to a number with MSB that is the
+	 * array index, will produce a number whose top 32 bits are its decimal
+	 * length, so that can now be had using:
+	 *
+	 * 1 CLZ instruction,
+	 * 1 table lookup,
+	 * 1 add, and
+	 * 1 shift
+	 *
+	 */
+	static const uint64 digit_transitions[32] = {
+		4294967295,  8589934582,  8589934582,  8589934582,  12884901788,
+		12884901788, 12884901788, 17179868184, 17179868184, 17179868184,
+		21474826480, 21474826480, 21474826480, 21474826480, 25769703776,
+		25769703776, 25769703776, 30063771072, 30063771072, 30063771072,
+		34349738368, 34349738368, 34349738368, 34349738368, 38554705664,
+		38554705664, 38554705664, 41949672960, 41949672960, 41949672960,
+		42949672960, 42949672960
 	};
 
-	/*
-	 * Compute base-10 logarithm by dividing the base-2 logarithm by a
-	 * good-enough approximation of the base-2 logarithm of 10
-	 */
-	t = (pg_leftmost_one_pos32(v) + 1) * 1233 / 4096;
-	return t + (v >= PowersOfTen[t]);
+	return (v + digit_transitions[pg_leftmost_one_pos32(v)]) >> 32;
 }
 
 static inline int
 decimalLength64(const uint64 v)
 {
-	int			t;
-	static const uint64 PowersOfTen[] = {
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000),
-		UINT64CONST(1), UINT64CONST(10),
-		UINT64CONST(100), UINT64CONST(1000)
-	};
-
-	/*
-	 * Compute base-10 logarithm by dividing the base-2 logarithm by a
-	 * good-enough approximation of the base-2 logarithm of 10
-	 */
-	t = (pg_leftmost_one_pos64(v) + 1) * 1233 / 4096;
-	return t + (v >= PowersOfTen[t]);
+	if (v >> 32 == 0)
+		return decimalLength32(v);
+	else if (div1e8(v) >> 32 == 0)
+		return 8 + decimalLength32(div1e8(v));
+	else
+		return 16 + decimalLength32(div1e8(div1e8(v)));
 }
 
 /*
-- 
2.31.1

Re: Added schema level support for publication.

2021-07-26 Thread Greg Nancarrow

On Mon, Jul 26, 2021 at 3:21 PM vignesh C  wrote:
>
> Thanks for the comment, this is modified in the v15 patch attached.
>

I have several minor review comments.

(1) src/backend/catalog/objectaddress.c
Should start comment sentences with an uppercase letter, for consistency.

+ /* fetch publication name and schema oid from input list */

I also notice that some 1-sentence comments end with "." (full-stop)
and others don't. It seems to alternate all over the place, and so is
quite noticeable.
Unfortunately, it already seems to be like this in much of the code
that this patch patches.
Ideally (at least my personal preference is) 1-sentence comments
should not end with a ".".

(2) src/backend/catalog/pg_publication.c
errdetail message

I think the following should say "Temporary schemas ..." (since the
existing error message for tables says "System tables cannot be added
to publications.").

+   errdetail("Temporary schema cannot be added to publications.")));

(3) src/backend/commands/publicationcmds.c
PublicationAddTables

I think that the Assert below is not correct (i.e. not restrictive enough).
Although the condition passes, it is allowing, for example,
stmt->for_all_tables==true if stmt->schemas==NIL, and that doesn't
seem to be correct.
I suggest the following change:

BEFORE:
+ Assert(!stmt || !stmt->for_all_tables || !stmt->schemas);
AFTER:
+ Assert(!stmt || (!stmt->for_all_tables && !stmt->schemas));

(4) src/backend/commands/publicationcmds.c
PublicationAddSchemas

Similarly, I think that the Assert below is not restrictive enough,
and think it should be changed:

BEFORE:
+ Assert(!stmt || !stmt->for_all_tables || !stmt->tables);
AFTER:
+ Assert(!stmt || (!stmt->for_all_tables && !stmt->tables));

(5) src/bin/pg_dump/common.c

Spelling mistake.

BEFORE:
+ pg_log_info("reading publciation schemas");
AFTER:
+ pg_log_info("reading publication schemas");

Regards,
Greg Nancarrow
Fujitsu Australia

pg_settings.pending_restart not set when line removed

2021-07-26 Thread Alvaro Herrera

Hi

I got a report from Gabriele Bartolini and team that the pg_settings
view does not get the pending_restart flag set when a setting's line is
removed from a file (as opposed to its value changed).

The explanation seems to be that GUC_PENDING_RESTART is set by
set_config_option, but when ProcessConfigFileInternal is called only to
provide data (applySettings=true), then set_config_option is never
called and thus the flag doesn't get set.

I tried the attached patch, which sets GUC_PENDING_RESTART if we're
doing pg_file_settings().  Then any subsequent read of pg_settings will
have the pending_restart flag set.  This seems to work correctly, and
consistently with the case where you change a line (without removing it)
in unpatched master.

You could argue that this is *weird* (why does reading pg_file_settings
set a flag in global state?) ... but that weirdness is not something
this patch is introducing.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Hay dos momentos en la vida de un hombre en los que no debería
especular: cuando puede permitírselo y cuando no puede" (Mark Twain)
>From 71fa384a6bf7aef58bdbe6d382cbc1219dbaa420 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 26 Jul 2021 18:35:09 -0400
Subject: [PATCH] fix guc

---
 src/backend/utils/misc/guc-file.l | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/backend/utils/misc/guc-file.l b/src/backend/utils/misc/guc-file.l
index 986ce542e3..1fe3af6284 100644
--- a/src/backend/utils/misc/guc-file.l
+++ b/src/backend/utils/misc/guc-file.l
@@ -354,6 +354,7 @@ ProcessConfigFileInternal(GucContext context, bool applySettings, int elevel)
 			  gconf->name),
 	 NULL, 0,
 	 &head, &tail);
+			gconf->status |= GUC_PENDING_RESTART;
 			error = true;
 			continue;
 		}
-- 
2.20.1

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 4:28 PM Stephen Frost  wrote:
> so ... yes and no.  There's an awful lot being ascribed to
> 'administrator' without any definition of it being actually given.  We
> are working in this thread to explicitly split up superuser privileges
> to allow them to be granted to non-superusers and talking about cases
> where those privileges end up interacting with each other.  Is Alice, as
> the 'network' manager considered an 'administrator' of XYZ?  Is Bob, as
> the 'database' manager considered an 'administrator'?  Perhaps both are,
> perhaps neither are.  It doesn't seem helpful to be vague.

XYZ was intended to stand in for something like 'network' or
'database' or whatever other particular part of PostgreSQL Alice might
be charged with administering.

> If Alice is given the right to create event triggers in a given
> database, then that's explicitly giving Alice the right to block anyone
> from dropping tables in that database because that's an inherent part of
> the event trigger system.  Should superusers be able to bypass that?
> Yes, they probably should be able to and, ideally, they'd be able to do
> that just in a particular session.

I agree.

> Should a user who has been allowed
> to modify certain GUCs that perhaps Alice hasn't been allowed to modify
> be able to be prevented from modifying those GUCs by Alice, when neither
> is a superuser?  That's definitely a trickier question and I don't know
> that I've got an answer offhand.

My answer would be "no".

I concede Mark's point in another email that if Alice can entirely
prevent Bob from connecting to the database then by inference she can
also prevent him from exercising any other privileges he may have. I'm
prepared to say that's OK; if Alice is administering network
connections to the database and cuts everyone else off, then I guess
that's just how it is. But if Bob does somehow succeed in getting a
connection to the database, then he should be able to exercise his
right to change those GUCs which he has permission to change. Alice
shouldn't be able to thwart that.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Tom Lane

Alvaro Herrera  writes:
> On 2021-Jul-26, Tom Lane wrote:
>> Uh, why not?  If you own the trigger, you can drop it, so why shouldn't
>> you be able to temporarily disable it?

> I think an auditing system that can be turned off by the audited user is
> pretty much useless.  Or did I misunderstood what you are suggesting?

For auditing purposes, you make a trusted role that owns the trigger,
and is a member of the roles whose actions are to be audited (but NOT
vice versa).  I think that any idea that the auditing role doesn't
need to be trusted that much is foolhardy.  What we can buy here is
not requiring the auditing role to be full superuser ... assuming that
you don't need auditing of superusers.

regards, tom lane

Re: automatically generating node support functions

2021-07-26 Thread Tom Lane

Peter Eisentraut  writes:
>> The first eight patches are to clean up various inconsistencies to make 
>> parsing or generation easier.

> Are there any concerns about the patches 0001 through 0008?  Otherwise, 
> maybe we could get those out of the way.

I looked through those and don't have any complaints (though I just
eyeballed them, I didn't see what a compiler would say).  I see
you pushed a couple of them already.

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Tom Lane wrote:

> Stephen Frost  writes:
> > ... Tom's suggestion
> > would work, of course, but it would mean having to create event triggers
> > for all the roles in the system, and would those roles who own those
> > event triggers be able to disable them..?
> 
> Uh, why not?  If you own the trigger, you can drop it, so why shouldn't
> you be able to temporarily disable it?

I think an auditing system that can be turned off by the audited user is
pretty much useless.  Or did I misunderstood what you are suggesting?

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Crear es tan difícil como ser libre" (Elsa Triolet)

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Tom Lane

Stephen Frost  writes:
> ... Tom's suggestion
> would work, of course, but it would mean having to create event triggers
> for all the roles in the system, and would those roles who own those
> event triggers be able to disable them..?

Uh, why not?  If you own the trigger, you can drop it, so why shouldn't
you be able to temporarily disable it?

> If so, it would almost
> certainly be against the point of an auditing event trigger..

If you want auditing capability, you make an auditor role that is
a member of every other role, and then it owns the trigger.  (If
you need to audit superuser actions too, then the auditor has to
be a superuser itself, but that's no worse than before; and I'd
argue that non-superusers shouldn't be able to audit superusers
anyway.)

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Tom Lane

I wrote:
> Possibly this could be generalized to "fire on commands performed by
> any role the trigger owner is a member of", but then I'm a bit less
> sure that it's safe from both roles' perspectives.

After further thought, I can't poke a hole in that concept.
We'd keep the rule that the trigger executes as the calling user.
Therefore, the trigger cannot perform any action that the calling
user couldn't do if she chose.  Conversely, since the trigger
owner could become a member of that role and then do whatever the
trigger intends to do, this scheme does not give the trigger owner
any new abilities either.  All we've done is provide what some
programming languages call an observer or annotation.

I also like the fact that with this rule, superusers' ability to
create event triggers that fire for everything is not a special case.

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Stephen Frost

Greetings,

* Alvaro Herrera (alvhe...@alvh.no-ip.org) wrote:
> On 2021-Jul-26, Tom Lane wrote:
> 
> > What if we allow event triggers owned by non-superusers, but only fire
> > them on commands performed by the trigger's owner?  This sidesteps all
> > the issues of who has which privileges and whether Alice is malicious
> > towards Bob or vice versa, because there is no change of privilege
> > domain.  Admittedly, it fails to cover some use-cases, but I think it
> > would still handle a lot of interesting cases.  The impression I have
> > is that a lot of applications do everything under just one or a few
> > roles.
> 
> This is similar but not quite an idea I had: have event triggers owned
> by non-superusers run for all non-superusers, but not for superusers.
> It is still the case that all non-superusers have to trust everyone with
> the event-trigger-create permission, but that's probably the database
> owner so most of the time you have to trust them already.

This sort of logic is what has caused issues with CREATEROLE, imv.  It's
simply not so simple as "don't run this when the superuser flag is set"
because non-superuser roles can become superusers.  We need something
better to have something like this actually be safe.  Tom's suggestion
would work, of course, but it would mean having to create event triggers
for all the roles in the system, and would those roles who own those
event triggers be able to disable them..?  If so, it would almost
certainly be against the point of an auditing event trigger..

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Tom Lane wrote:

> What if we allow event triggers owned by non-superusers, but only fire
> them on commands performed by the trigger's owner?  This sidesteps all
> the issues of who has which privileges and whether Alice is malicious
> towards Bob or vice versa, because there is no change of privilege
> domain.  Admittedly, it fails to cover some use-cases, but I think it
> would still handle a lot of interesting cases.  The impression I have
> is that a lot of applications do everything under just one or a few
> roles.

This is similar but not quite an idea I had: have event triggers owned
by non-superusers run for all non-superusers, but not for superusers.
It is still the case that all non-superusers have to trust everyone with
the event-trigger-create permission, but that's probably the database
owner so most of the time you have to trust them already.

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: Removing "long int"-related limit on hash table sizes

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Tom Lane wrote:

> ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> > We also have the (U)INT64CONST() macros, which are about about two
> > thirds as common as the U?LL? suffixes.
> 
> Yeah.  Ideally we'd forbid direct use of the suffixes and insist
> you go through those macros, but I don't know of any way that
> we could enforce such a coding rule, short of grepping the tree
> periodically.

IIRC we have one buildfarm member that warns us about perlcritic; maybe
this is just another setup of that sort.

(Personally I run the perlcritic check in my local commit-verifying
script before pushing.)

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"XML!" Exclaimed C++.  "What are you doing here? You're not a programming
language."
"Tell that to the people who use me," said XML.
https://burningbird.net/the-parable-of-the-languages/

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Jul 26, 2021 at 4:12 PM Robert Haas  wrote:
> > I think I may not have expressed myself clearly enough here. What I'm
> > concerned about is: Alice should not be permitted to preventing Bob
> > from doing something which Bob is allowed to do and Alice is not
> > allowed to do. If Alice is the administrator of PostgreSQL's XYZ
> > subsystem, she can permit Bob from using it if she wishes. But if Bob
> 
> argh, typo. I meant prevent, not permit.
> 
> > is an administrator of XYZ and Alice is not, there shouldn't be a way
> > for Alice to obstruct Bob's access to that system.

> Do you agree?

so ... yes and no.  There's an awful lot being ascribed to
'administrator' without any definition of it being actually given.  We
are working in this thread to explicitly split up superuser privileges
to allow them to be granted to non-superusers and talking about cases
where those privileges end up interacting with each other.  Is Alice, as
the 'network' manager considered an 'administrator' of XYZ?  Is Bob, as
the 'database' manager considered an 'administrator'?  Perhaps both are,
perhaps neither are.  It doesn't seem helpful to be vague.

If Alice is given the right to create event triggers in a given
database, then that's explicitly giving Alice the right to block anyone
from dropping tables in that database because that's an inherent part of
the event trigger system.  Should superusers be able to bypass that?
Yes, they probably should be able to and, ideally, they'd be able to do
that just in a particular session.  Should a user who has been allowed
to modify certain GUCs that perhaps Alice hasn't been allowed to modify
be able to be prevented from modifying those GUCs by Alice, when neither
is a superuser?  That's definitely a trickier question and I don't know
that I've got an answer offhand.

Thanks,

Stephen

signature.asc
Description: PGP signature

Re: log_checkpoint's "WAL file(s) added" is misleading to the point of uselessness

2021-07-26 Thread Bossart, Nathan

On 7/24/21, 8:10 PM, "Fujii Masao"  wrote:
> On 2021/07/25 7:50, Andres Freund wrote:
>> Hi,
>>
>> I've been repeatedly confused by the the number of WAL files supposedly
>> added. Even when 100s of new WAL files are created the relevant portion
>> of log_checkpoints will only ever list zero or one added WAL file.
>>
>> The reason for that is that CheckpointStats.ckpt_segs_added is only
>> incremented in PreallocXlogFiles(). Which has the following comment:
>>   * XXX this is currently extremely conservative, since it forces only one
>>   * future log segment to exist, and even that only if we are 75% done with
>>   * the current one.  This is only appropriate for very low-WAL-volume 
>> systems.
>>
>> Whereas in real workloads WAL files are almost exclusively created via
>> XLogWrite()->XLogFileInit().
>>
>> I think we should consider just removing that field. Or, even better, show
>> something accurate instead.
>
> +1 to show something accurate instead.
>
> It's also worth showing them in monitoring stats view like pg_stat_wal?

+1.  I was confused by this when working on a WAL pre-allocation
patch [0].  Perhaps it could be replaced by a new parameter and a new
field in pg_stat_wal.  How about something like log_wal_init_interval,
where the value is the minimum amount of time between reporting the
number of WAL segments created since the last report?

Nathan

[0] 
https://www.postgresql.org/message-id/flat/20201225200953.jjkrytlrzojbn...@alap3.anarazel.de

Re: Removing "long int"-related limit on hash table sizes

2021-07-26 Thread Tom Lane

ilm...@ilmari.org (Dagfinn Ilmari =?utf-8?Q?Manns=C3=A5ker?=) writes:
> We also have the (U)INT64CONST() macros, which are about about two
> thirds as common as the U?LL? suffixes.

Yeah.  Ideally we'd forbid direct use of the suffixes and insist
you go through those macros, but I don't know of any way that
we could enforce such a coding rule, short of grepping the tree
periodically.

regards, tom lane

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Mark Dilger

> On Jul 26, 2021, at 1:12 PM, Robert Haas  wrote:
> 
> Alice should not be permitted to preventing Bob
> from doing something which Bob is allowed to do and Alice is not
> allowed to do.

That sounds intuitively reasonable, though it depends on what "which Bob is 
allowed to do" means.  For instance, if Alice is only allowed to enable or 
disable connections to the database, and she disables them, then she has 
prevented Bob from, for example, creating tables, something which Bob is 
otherwise allowed to do, because without the ability to connect, he cannot 
create tables.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Tom Lane

Stephen Frost  writes:
> As I understood Mark's suggestion, the trigger would run but would have
> the privileges of the intersection of both user's permissions, which is
> an interesting idea but not one we've got any way to really do today as
> each privilege check would now need to check two different roles for
> privilege- and if one of the privilege checks fails, then what..?
> Presumably there would be an ERROR returned, meaning that the operation
> would be able to be prevented from happening by the trigger author,
> which was objected to as not being acceptable either, per below.

I've not been paying close attention, so maybe this was already
considered, but ...

What if we allow event triggers owned by non-superusers, but only fire
them on commands performed by the trigger's owner?  This sidesteps all
the issues of who has which privileges and whether Alice is malicious
towards Bob or vice versa, because there is no change of privilege
domain.  Admittedly, it fails to cover some use-cases, but I think it
would still handle a lot of interesting cases.  The impression I have
is that a lot of applications do everything under just one or a few
roles.

Possibly this could be generalized to "fire on commands performed by
any role the trigger owner is a member of", but then I'm a bit less
sure that it's safe from both roles' perspectives.

regards, tom lane

Re: Removing "long int"-related limit on hash table sizes

2021-07-26 Thread Dagfinn Ilmari Mannsåker

Alvaro Herrera  writes:

> On 2021-Jul-25, Ranier Vilela wrote:
>
>> > BTW, one aspect of this that I'm unsure how to tackle is the
>> > common usage of "L" constants; in particular, "work_mem * 1024L"
>> > is a really common idiom that we'll need to get rid of.  Not sure
>> > that grep will be a useful aid for finding those.
>> >
>> I can see 30 matches in the head tree. (grep -d "1024L" *.c)
>
> grep grep '[0-9]L\>' -- *.[chyl]
> shows some more constants.

git grep -Eiw '(0x[0-9a-f]+|[0-9]+)U?LL?' -- *.[chyl]

gives about a hundred more hits.

We also have the (U)INT64CONST() macros, which are about about two
thirds as common as the U?LL? suffixes.

- ilmari

Re: needless complexity in StartupXLOG

2021-07-26 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost  wrote:
> > Yeah, tend to agree with this too ... but something I find a bit curious
> > is the comment:
> >
> > * Insert a special WAL record to mark the end of
> > * recovery, since we aren't doing a checkpoint.
> >
> > ... immediately after setting promoted = true, and then at the end of
> > StartupXLOG() having:
> >
> > if (promoted)
> > RequestCheckpoint(CHECKPOINT_FORCE);
> >
> > maybe I'm missing something, but seems like that comment isn't being
> > terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> > sure looks like we're going to do one moments later regardless of
> > anything else since we've set promoted to true..?
> 
> Yep. So it's a question of whether we allow operations that might
> write WAL in the meantime. When we write the checkpoint record right
> here, there can't be any WAL from the new server lifetime until the
> checkpoint completes. When we write an end-of-recovery record, there
> can. And there could actually be quite a bit, because if we do the
> checkpoint right in this section of code, it will be a fast
> checkpoint, whereas in the code you quoted above, it's a spread
> checkpoint, which takes a lot longer. So the question is whether it's
> reasonable to give the checkpoint some time to complete or whether it
> needs to be completed right now.

All I was really trying to point out above was that the comment might be
improved upon, just so someone understands that we aren't doing a
checkpoint at this particular place, but one will be done later due to
the promotion.  Maybe I'm being a bit extra with that, but that was my
thought when reading the code and the use of the promoted flag variable.

> > Agreed that simpler logic is better, provided it's correct logic, of
> > course.  Finding better ways to test all of this would be really nice.
> 
> Yeah, and there again, it's a lot easier to test if we don't have as
> many cases. Now no single change is going to fix that, but the number
> of flag variables in xlog.c is simply bonkers. This particular stretch
> of code uses 3 of them to even decide whether to attempt the test in
> question, and all of those are set in complex ways depending on the
> values of still more flag variables. The comments where
> bgwriterLaunched is set claim that we only do this during archive
> recovery, not crash recovery, but the code depends on the value of
> ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
> can't get the bgwriter to run even during crash recovery in the
> scenario described by:
> 
>  * It's possible that archive recovery was requested, but we don't
>  * know how far we need to replay the WAL before we reach consistency.
>  * This can happen for example if a base backup is taken from a
>  * running server using an atomic filesystem snapshot, without calling
>  * pg_start/stop_backup. Or if you just kill a running primary server
>  * and put it into archive recovery by creating a recovery signal
>  * file.
> 
> If we ran the bgwriter all the time during crash recovery, we'd know
> for sure whether that causes any problems. If we only do it like this
> in certain corner cases, it's much more likely that we have bugs.
> Grumble, grumble.

Yeah ... not to mention that it really is just incredibly dangerous to
use such an approach for PITR.  For my 2c, I'd rather we figure out a
way to prevent this than to imply that we support it when we have no way
of knowing if we actually have replayed far enough to be consistent.
That isn't to say that using snapshots for database backups isn't
possible, but it should be done in-between pg_start/stop_backup calls
which properly grab the returned info from those and store the backup
label with the snapshot, etc.

Thanks,

Stephen


signature.asc
Description: PGP signature

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 4:12 PM Robert Haas  wrote:
> I think I may not have expressed myself clearly enough here. What I'm
> concerned about is: Alice should not be permitted to preventing Bob
> from doing something which Bob is allowed to do and Alice is not
> allowed to do. If Alice is the administrator of PostgreSQL's XYZ
> subsystem, she can permit Bob from using it if she wishes. But if Bob

argh, typo. I meant prevent, not permit.

> is an administrator of XYZ and Alice is not, there shouldn't be a way
> for Alice to obstruct Bob's access to that system.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 4:05 PM Stephen Frost  wrote:
> As I understood Mark's suggestion, the trigger would run but would have
> the privileges of the intersection of both user's permissions, which is
> an interesting idea but not one we've got any way to really do today as
> each privilege check would now need to check two different roles for
> privilege- and if one of the privilege checks fails, then what..?
> Presumably there would be an ERROR returned, meaning that the operation
> would be able to be prevented from happening by the trigger author,
> which was objected to as not being acceptable either, per below.

I think I may not have expressed myself clearly enough here. What I'm
concerned about is: Alice should not be permitted to preventing Bob
from doing something which Bob is allowed to do and Alice is not
allowed to do. If Alice is the administrator of PostgreSQL's XYZ
subsystem, she can permit Bob from using it if she wishes. But if Bob
is an administrator of XYZ and Alice is not, there shouldn't be a way
for Alice to obstruct Bob's access to that system.

Do you agree?

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Stephen Frost

Greetings,


* Robert Haas (robertmh...@gmail.com) wrote:
> On Fri, Jul 23, 2021 at 4:57 PM Mark Dilger
>  wrote:
> > > On Jul 23, 2021, at 1:54 PM, Robert Haas  wrote:
> > > Yeah, but you're inventing a system for allowing the restriction on a
> > > GUC to be something other than is-superuser in the very patch we're
> > > talking about. So it could be something like is-database-security.
> > > Therefore I don't grok the objection.
> >
> > I'm not objecting to how hard it would be to implement.  I'm objecting to 
> > the semantics.  If the only non-superuser who can set the GUC is 
> > pg_database_security, then it is absolutely worthless in preventing 
> > pg_database_security from trapping actions performed by pg_network_security 
> > members.  On the other hand, if pg_network_security can also set the GUC, 
> > then pg_network_security can circumvent audit logging that 
> > pg_database_security put in place.  What's the point in having these as 
> > separate roles if they can circumvent each other's authority?
> 
> Right, that would be bad. I had forgotten how this worked, but it
> seems that event triggers are called with the privileges of the user
> whose action caused the event trigger to be fired, not the privileges
> of the user who owns the trigger. So as you say, if you can get
> somebody to do something that causes an event trigger to be fired, you
> can do anything they can do. As far as I can see, the only reasonable
> conclusion is that, unless we change the security model, doing
> anything with event triggers will have to remain superuser-only. In
> other words I don't think we can give it to any of
> pg_database_security or pg_host_security or pg_network_security, or
> any similar role. We could have a pg_event_triggers role that is
> documented as able to usurp superuser, but I don't see the point.

Right- event triggers work just the same as how regular triggers on
tables do and how RLS works.  All of these also have the possibility of
leveraging security definer functions, of course, but that doesn't
address the issue of the trigger author attempting to attack the
individual running the trigger.

I do think it'd be useful to have a pg_event_triggers or such role, so
that someone could create them without being a superuser.  A bit more
discussion about that below though..

> Now, the other alternative is changing the security model for event
> triggers, but I am not sure that really fixes anything. You proposed
> having a new mode where someone could only do things that could be
> done by either user, but that troubles me for a number of reasons. One
> is that it often makes a difference who actually did a particular
> operation. For example it might be that alice and bob both have the
> ability to give charlie permission on some table, but the ACL for that
> table will record who actually issued the grant. It might be that both
> alice and bob have the ability to create a table, but the table will
> be owned by whoever actually does. Suppose bob is about to be
> terminated but can arrange for alice (who is a star performer) to
> grant permissions to his accomplice charlie, thus arranging for those
> permissions to survive his impending termination. That's bad.

As I understood Mark's suggestion, the trigger would run but would have
the privileges of the intersection of both user's permissions, which is
an interesting idea but not one we've got any way to really do today as
each privilege check would now need to check two different roles for
privilege- and if one of the privilege checks fails, then what..?
Presumably there would be an ERROR returned, meaning that the operation
would be able to be prevented from happening by the trigger author,
which was objected to as not being acceptable either, per below.

> Also, what about just throwing an ERROR? Anybody's allowed to do that,
> but that doesn't mean that it's OK for one user to block everything
> some other user wants to do. If seward and bates respectively have
> pg_database_security and pg_network_security, it's OK for seward to
> interfere with attempts by bates to access database objects, but it's
> not OK for seward to prevent bates from reconfiguring network access
> to PostgreSQL. Because event triggers don't fire for ALTER SYSTEM or
> DDL commands on global objects, we might almost be OK here, but I'm
> not sure if it's completely OK.

Regular table triggers can be used to block someone from messing with
that table, so this isn't entirely unheard of.  Deciding that someone
with event trigger access is allowed to prevent certain things from
happening in a database that they're allowed to connect and create event
triggers in may not be completely unreasonable.

> I'm pretty sure that the reason we set this up the way we did was
> because we assumed that the person creating the event trigger would
> always have maximum privileges i.e. superuser. Therefore, it seemed
> "safer" to run the code under the less-privileged account. If

Re: [PATCH] proposal for regexp_count, regexp_instr, regexp_substr and regexp_replace

2021-07-26 Thread Tom Lane

Gilles Darold  writes:
> [ v4-0001-regexp-foo-functions.patch ]

I started to work through this and was distressed to realize that
it's trying to redefine regexp_replace() in an incompatible way.
We already have

regression=# \df regexp_replace
   List of functions
   Schema   |  Name  | Result data type |  Argument data types   | Type 
++--++--
 pg_catalog | regexp_replace | text | text, text, text   | func
 pg_catalog | regexp_replace | text | text, text, text, text | func
(2 rows)

The patch proposes to add (among other alternatives)

+{ oid => '9608', descr => 'replace text using regexp',
+  proname => 'regexp_replace', prorettype => 'text',
+  proargtypes => 'text text text int4', prosrc => 
'textregexreplace_extended_no_occurrence' },

which is going to be impossibly confusing for both humans and machines.
I don't think we should go there.  Even if you managed to construct
examples that didn't result in "ambiguous function" failures, that
doesn't mean that ordinary mortals won't get bit that way.

I'm inclined to just drop the regexp_replace additions.  I don't think
that the extra parameters Oracle provides here are especially useful.
They're definitely not useful enough to justify creating compatibility
hazards for.

regards, tom lane

Re: needless complexity in StartupXLOG

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost  wrote:
> Yeah, tend to agree with this too ... but something I find a bit curious
> is the comment:
>
> * Insert a special WAL record to mark the end of
> * recovery, since we aren't doing a checkpoint.
>
> ... immediately after setting promoted = true, and then at the end of
> StartupXLOG() having:
>
> if (promoted)
> RequestCheckpoint(CHECKPOINT_FORCE);
>
> maybe I'm missing something, but seems like that comment isn't being
> terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
> sure looks like we're going to do one moments later regardless of
> anything else since we've set promoted to true..?

Yep. So it's a question of whether we allow operations that might
write WAL in the meantime. When we write the checkpoint record right
here, there can't be any WAL from the new server lifetime until the
checkpoint completes. When we write an end-of-recovery record, there
can. And there could actually be quite a bit, because if we do the
checkpoint right in this section of code, it will be a fast
checkpoint, whereas in the code you quoted above, it's a spread
checkpoint, which takes a lot longer. So the question is whether it's
reasonable to give the checkpoint some time to complete or whether it
needs to be completed right now.

> Agreed that simpler logic is better, provided it's correct logic, of
> course.  Finding better ways to test all of this would be really nice.

Yeah, and there again, it's a lot easier to test if we don't have as
many cases. Now no single change is going to fix that, but the number
of flag variables in xlog.c is simply bonkers. This particular stretch
of code uses 3 of them to even decide whether to attempt the test in
question, and all of those are set in complex ways depending on the
values of still more flag variables. The comments where
bgwriterLaunched is set claim that we only do this during archive
recovery, not crash recovery, but the code depends on the value of
ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we
can't get the bgwriter to run even during crash recovery in the
scenario described by:

 * It's possible that archive recovery was requested, but we don't
 * know how far we need to replay the WAL before we reach consistency.
 * This can happen for example if a base backup is taken from a
 * running server using an atomic filesystem snapshot, without calling
 * pg_start/stop_backup. Or if you just kill a running primary server
 * and put it into archive recovery by creating a recovery signal
 * file.

If we ran the bgwriter all the time during crash recovery, we'd know
for sure whether that causes any problems. If we only do it like this
in certain corner cases, it's much more likely that we have bugs.
Grumble, grumble.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Some code cleanup for pgbench and pg_verifybackup

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Fabien COELHO wrote:

> > - pgbench and pg_verifybackup make use of pg_log_fatal() to report
> > some failures in code paths dedicated to command-line options, which
> > is inconsistent with all the other tools that use pg_log_error().
> 
> The semantics for fatal vs error is that an error is somehow handled while a
> fatal is not. If the log message is followed by an cold exit, ISTM that
> fatal is the right call, and I cannot help if other commands do not do that.
> ISTM more logical to align other commands to fatal when appropriate.

I was surprised to discover a few weeks ago that pg_log_fatal() did not
terminate the program, which was my expectation.  If every single call
to pg_log_fatal() is followed by exit(1), why not make pg_log_fatal()
itself exit?  Apparently this coding pattern confuses many people -- for
example pg_verifybackup.c lines 291ff fail to exit(1) after "throwing" a
fatal error, while the block at lines 275 does the right thing.

(In reality we cannot literally just exit(1) in pg_log_fatal(), because
there are quite a few places that do some other thing after the log
call and before exit(1), or terminate the program in some other way than
exit().)

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
Essentially, you're proposing Kevlar shoes as a solution for the problem
that you want to walk around carrying a loaded gun aimed at your foot.
(Tom Lane)

Re: list of extended statistics on psql (\dX)

2021-07-26 Thread Tomas Vondra


Hi,

I've pushed the last version of the fix, including the regression tests 
etc. Backpatch to 14, where \dX was introduced.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Removing "long int"-related limit on hash table sizes

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-25, Ranier Vilela wrote:

> > BTW, one aspect of this that I'm unsure how to tackle is the
> > common usage of "L" constants; in particular, "work_mem * 1024L"
> > is a really common idiom that we'll need to get rid of.  Not sure
> > that grep will be a useful aid for finding those.
> >
> I can see 30 matches in the head tree. (grep -d "1024L" *.c)

grep grep '[0-9]L\>' -- *.[chyl]
shows some more constants.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/

Re: Delegating superuser tasks to new security roles (Was: Granting control of SUSET gucs to non-superusers)

2021-07-26 Thread Robert Haas

On Fri, Jul 23, 2021 at 4:57 PM Mark Dilger
 wrote:
> > On Jul 23, 2021, at 1:54 PM, Robert Haas  wrote:
> > Yeah, but you're inventing a system for allowing the restriction on a
> > GUC to be something other than is-superuser in the very patch we're
> > talking about. So it could be something like is-database-security.
> > Therefore I don't grok the objection.
>
> I'm not objecting to how hard it would be to implement.  I'm objecting to the 
> semantics.  If the only non-superuser who can set the GUC is 
> pg_database_security, then it is absolutely worthless in preventing 
> pg_database_security from trapping actions performed by pg_network_security 
> members.  On the other hand, if pg_network_security can also set the GUC, 
> then pg_network_security can circumvent audit logging that 
> pg_database_security put in place.  What's the point in having these as 
> separate roles if they can circumvent each other's authority?

Right, that would be bad. I had forgotten how this worked, but it
seems that event triggers are called with the privileges of the user
whose action caused the event trigger to be fired, not the privileges
of the user who owns the trigger. So as you say, if you can get
somebody to do something that causes an event trigger to be fired, you
can do anything they can do. As far as I can see, the only reasonable
conclusion is that, unless we change the security model, doing
anything with event triggers will have to remain superuser-only. In
other words I don't think we can give it to any of
pg_database_security or pg_host_security or pg_network_security, or
any similar role. We could have a pg_event_triggers role that is
documented as able to usurp superuser, but I don't see the point.

Now, the other alternative is changing the security model for event
triggers, but I am not sure that really fixes anything. You proposed
having a new mode where someone could only do things that could be
done by either user, but that troubles me for a number of reasons. One
is that it often makes a difference who actually did a particular
operation. For example it might be that alice and bob both have the
ability to give charlie permission on some table, but the ACL for that
table will record who actually issued the grant. It might be that both
alice and bob have the ability to create a table, but the table will
be owned by whoever actually does. Suppose bob is about to be
terminated but can arrange for alice (who is a star performer) to
grant permissions to his accomplice charlie, thus arranging for those
permissions to survive his impending termination. That's bad.

Also, what about just throwing an ERROR? Anybody's allowed to do that,
but that doesn't mean that it's OK for one user to block everything
some other user wants to do. If seward and bates respectively have
pg_database_security and pg_network_security, it's OK for seward to
interfere with attempts by bates to access database objects, but it's
not OK for seward to prevent bates from reconfiguring network access
to PostgreSQL. Because event triggers don't fire for ALTER SYSTEM or
DDL commands on global objects, we might almost be OK here, but I'm
not sure if it's completely OK.

I'm pretty sure that the reason we set this up the way we did was
because we assumed that the person creating the event trigger would
always have maximum privileges i.e. superuser. Therefore, it seemed
"safer" to run the code under the less-privileged account. If we'd
thought about this from the perspective of having non-superuser-owned
event triggers, I think we would have made the opposite decision,
since running code as yourself in somebody else's session is less
dangerous than running code as somebody else straight up.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Tom Lane

Alvaro Herrera  writes:
> Oh, you meant this one.  To be honest I don't remember *why* this code
> wants to show remote temp tables as just "pg_temp" ... it's possible
> that some test in the DDL-to-JSON code depended on this behavior.
> Without spending too much time analyzing it, I agree that it seems
> dangerous and might lead to referring to unintended objects.  (Really,
> my memory is not clear on *why* we would be referring to temp tables of
> other sessions.)

Yeah, it's not very clear why that would happen, but if it does,
showing "pg_temp" seems pretty misleading.  I tried replacing the
code with just get_namespace_name_or_temp(), and it still gets
through check-world, for whatever that's worth.

I'm inclined to change this in HEAD but leave it alone in the back
branches.  While it seems pretty bogus, it's not clear if anyone
out there could be relying on the current behavior.

regards, tom lane

Re: Rename of triggers for partitioned tables

2021-07-26 Thread Arne Roland

From: Tom Lane 
Sent: Monday, July 26, 2021 19:38
To: Alvaro Herrera
Subject: Re: Rename of triggers for partitioned tables

> Yeah, we don't support partial indexes on catalogs, and this example
> doesn't make me feel like we ought to open that can of worms.

I asked why such an index doesn't exists already. I guess that answers it. I 
definitely don't want to push for supporting that, especially not knowing what 
it entails.

> But then maybe this function shouldn't assume there's only one match?

 I'd consider a duplication here a serious bug. That is pretty much catalog 
corruption. Having two children within a single child table sounds like it 
would break a lot of things.

That being said this function is hardly performance critical. I don't know 
whether throwing an elog indicating what's going wrong here would be worth it.

Regards
Arne

Why don't update minimum recovery point in xact_redo_abort

2021-07-26 Thread 蔡梦娟(玊于)

Hi, all

Recently, I got a PANIC while restarts standby, which can be reproduced by the 
following steps, based on pg 11:
1. begin a transaction in primary node;
2. create a table in the transaction;
3. insert lots of data into the table;
4. do a checkpoint, and restart standby after checkpoint is done in primary 
node;
5. insert/update lots of data into the table again;
6. abort the transaction.

after step 6, fast shutdown standby node, and then restart standby, you will 
get a PANIC log, and the backtrace is:
#0  0x7fc663e5a277 in raise () from /lib64/libc.so.6
#1  0x7fc663e5b968 in abort () from /lib64/libc.so.6
#2  0x00c89f01 in errfinish (dummy=0) at elog.c:707
#3  0x00c8cba3 in elog_finish (elevel=22, fmt=0xdccc18 "WAL contains 
references to invalid pages") at elog.c:1658
#4  0x005e476a in XLogCheckInvalidPages () at xlogutils.c:253
#5  0x005cbc1a in CheckRecoveryConsistency () at xlog.c:9477
#6  0x005ca5c5 in StartupXLOG () at xlog.c:8609
#7  0x00a025a5 in StartupProcessMain () at startup.c:274
#8  0x00643a5c in AuxiliaryProcessMain (argc=2, argv=0x7ffe4e4849a0) at 
bootstrap.c:485
#9  0x00a00620 in StartChildProcess (type=StartupProcess) at 
postmaster.c:6215
#10 0x009f92c6 in PostmasterMain (argc=3, argv=0x4126500) at 
postmaster.c:1506
#11 0x008eab64 in main (argc=3, argv=0x4126500) at main.c:232

I think the reason for the above error is as follows:
1. the transaction in primary node was aborted finally, the standby node also 
deleted the table files after replayed the xlog record, however, without 
updating minimum recovery point;
2. primary node did a checkpoint before abort, and then standby node is 
restarted, so standby node will recovery from a point where the table has 
already been created and data has been inserted into the table;
3. when standby node restarts after step 6, it will find the page needed during 
recovery doesn't exist, which has already been deleted by xact_redo_abort 
before, so standby node will treat this page as an invalid page;
4. xact_redo_abort drop relation files without updating minumum recovery point, 
before standby node replay the abort xlog record and forget invalid pages 
again, it will reach consistency because the abort xlogrecord lsn is greater 
than minrecoverypoint;
5. during checkRecoveryConsistency, it will check invalid pages, and find that 
there is invalid page, and the PANIC log will be generated.

So why don't update minimum recovery point in xact_redo_abort, just like 
XLogFlush in xact_redo_commit, in which way standby could reach consistency and 
check invalid pages after replayed the abort xlogrecord.
Hope to get your reply

Thanks & Best Regard

Re: Fix around conn_duration in pgbench

2021-07-26 Thread Fujii Masao





On 2021/06/30 14:43, Yugo NAGATA wrote:

On Wed, 30 Jun 2021 14:35:37 +0900
Yugo NAGATA  wrote:


Hello Asif,

On Tue, 29 Jun 2021 13:21:54 +
Asif Rehman  wrote:


The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:not tested

This patch looks fine to me. master 48cb244fb9

The new status of this patch is: Ready for Committer


Thank you for reviewing this patch!

The previous patch included fixes about connection failures, but this part
was moved to another patch discussed in [1].

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.22.394.2106181535400.3146194%40pseudo

I attached the updated patach.


I am sorry but I attached the other patch. Attached in this post
is the latest patch.


case CSTATE_FINISHED:
+   /* per-thread last disconnection time is not 
measured */

Could you tell me why we don't need to do this measurement?


-   /* no connection delay to record */
-   thread->conn_duration = 0;
+   /* connection delay is measured globally between the barriers */

This comment is really correct? I was thinking that the measurement is not 
necessary here because this is the case where -C option is not specified.


done:
start = pg_time_now();
disconnect_all(state, nstate);
thread->conn_duration += pg_time_now() - start;

We should measure the disconnection time here only when -C option specified 
(i.e., is_connect variable is true)? Though, I'm not sure how much this change 
is helpful to reduce the performance overhead

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Tom Lane wrote:

> On the other hand, I don't like 0002 one bit, because it's not accounting
> for whether the temp schema it's mangling is *our own* temp schema or some
> other session's.  I do not think it is wise or even safe to report some
> other temp schema as being "pg_temp".  By the same token, I wonder whether
> this bit in event_trigger.c is a good idea or a safety hazard:
> 
> /* XXX not quite get_namespace_name_or_temp */
> if (isAnyTempNamespace(schema_oid))
> schema = pstrdup("pg_temp");
> else
> schema = get_namespace_name(schema_oid);

Oh, you meant this one.  To be honest I don't remember *why* this code
wants to show remote temp tables as just "pg_temp" ... it's possible
that some test in the DDL-to-JSON code depended on this behavior.
Without spending too much time analyzing it, I agree that it seems
dangerous and might lead to referring to unintended objects.  (Really,
my memory is not clear on *why* we would be referring to temp tables of
other sessions.)

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"No necesitamos banderas
 No reconocemos fronteras"  (Jorge González)

Re: Rename of triggers for partitioned tables

2021-07-26 Thread Tom Lane

Alvaro Herrera  writes:
> Arne complained that there should be a unique constraint on (tgrelid,
> tgparentid) which would sidestep the need for this to be a loop.  I
> don't think it's really necessary, and I'm not sure how to create a
> system index WHERE tgparentid <> 0.

Yeah, we don't support partial indexes on catalogs, and this example
doesn't make me feel like we ought to open that can of worms.  But
then maybe this function shouldn't assume there's only one match?

regards, tom lane

Re: Have I found an interval arithmetic bug?

2021-07-26 Thread Bruce Momjian

On Sun, Jul 25, 2021 at 11:56:54AM -0700, Bryn Llewellyn wrote:
> As far as I’ve been able, the PG documentation doesn’t do a good job of
> defining the semantics of any of these operations. Some (like the “justify”

This is because fractional interval values are not used or asked about
often.

> functions” are sketched reasonably well. Others, like interval multiplication,
> are entirely undefined.

Yes, the “justify” functions were requested and implemented because they
met a frequently-requested need unrelated to fractional values, though
they do have spill-up uses.

> This makes discussion of simple test like the two I showed immediately above
> hard. It also makes any discussion of correctness, possible bugs, and proposed
> implementation changes very difficult.

Agreed.  With fractional values an edge use-case, we are trying to find
the most useful implementation.

> As I’ve said, my conclusion is that the only safe approach is to create and 
> use
> only “pure” interval values (where just one of the internal fields is
> non-zero). For this reason (and having seen what I decided was the impossibly
> unmemorable rules that my modeled implementation uses) I didn’t look at the
> rules for the other fields that the interval literal allows (weeks, centuries,
> millennia, and so on).

I think the current page is clear about _specifying_ fractional units,
but you are right that multiplication/division of fractional values is
not covered.

-- 
  Bruce Momjian  https://momjian.us
  EDB  https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jul 26, 2021 at 12:49 PM Tom Lane  wrote:
>> I can't really see any situation in which it's important
>> to report the exact schema name of our own temp schema.

> It would actually be nice if there were some easy way of getting that
> for the rare situations in which there are problems.

I experimented with pushing the behavior into get_namespace_name,
and immediately ran into problems, for example

--- /home/postgres/pgsql/src/test/regress/expected/jsonb.out2021-03-01 16:32
:13.348655633 -0500
+++ /home/postgres/pgsql/src/test/regress/results/jsonb.out 2021-07-26 13:10
:53.523540855 -0400
@@ -320,11 +320,9 @@
 where tablename = 'rows' and
   schemaname = pg_my_temp_schema()::regnamespace::text
 order by 1;
- attname | histogram_bounds 
--+--
- x   | [1, 2, 3]
- y   | ["txt1", "txt2", "txt3"]
-(2 rows)
+ attname | histogram_bounds 
+-+--
+(0 rows)
 
 -- to_jsonb, timestamps
 select to_jsonb(timestamp '2014-05-28 12:22:35.614298');

What's happening here is that regnamespace_out is returning
'pg_temp' which doesn't match any name visible in pg_namespace.
So that would pretty clearly break user queries as well as
our own tests.  I'm afraid that the wholesale behavior change
I was imagining isn't going to work.  Probably we'd better stick
to doing something close to the v2 patch I posted.

I'm still suspicious of that logic in event_trigger.c, though.

regards, tom lane

Re: Use WaitLatch for {pre, post}_auth_delay instead of pg_usleep

2021-07-26 Thread Bossart, Nathan

On 7/24/21, 9:16 AM, "Bharath Rupireddy" 
 wrote:
> On Fri, Jul 23, 2021 at 4:40 AM Bossart, Nathan  wrote:
>> I would suggest changing "attach from a debugger" to "attaching with a
>> debugger."
>
> Thanks. IMO, the following looks better:
> Waiting on connection startup before authentication to allow
>   attaching a debugger to the process.
> Waiting on connection startup after authentication to allow
>   attaching a debugger to the process.

Your phrasing looks good to me.

>> IIUC you want to use the same set of flags as PostAuthDelay for
>> PreAuthDelay, but the stated reason in this comment for leaving out
>> WL_LATCH_SET suggests otherwise.  It's not clear to me why the latch
>> possibly pointing to a shared latch in the future is an issue.  Should
>> this instead say that we leave out WL_LATCH_SET for consistency with
>> PostAuthDelay?
>
> If WL_LATCH_SET is used for PostAuthDelay, the waiting doesn't happen
> because the MyLatch which is a shared latch would be set by
> SwitchToSharedLatch. More details at [1].
> If WL_LATCH_SET is used for PreAuthDelay, actually there's no problem
> because MyLatch is still not initialized properly in BackendInitialize
> when waiting for PreAuthDelay, it still points to local latch, but
> later gets pointed to shared latch and gets set SwitchToSharedLatch.
> But relying on MyLatch there seems to me somewhat relying on an
> uninitialized variable. More details at [1].
>
> For PreAuthDelay, with the comment I wanted to say that the MyLatch is
> not the correct one we would want to wait for. Since there is no
> problem in using it there, I changed the comment to following:
> /*
>  * Let's not use WL_LATCH_SET for PreAuthDelay to be consistent with
>  * PostAuthDelay.
>  */

How about we elaborate a bit?

WL_LATCH_SET is not used for consistency with PostAuthDelay.
MyLatch isn't fully initialized for the backend at this point,
anyway.

+   /*
+* PostAuthDelay will not get applied, if WL_LATCH_SET is used. 
This
+* is because the latch could have been set initially.
+*/

I would suggest the following:

If WL_LATCH_SET is used, PostAuthDelay may not be applied,
since the latch might already be set.

Otherwise, this patch looks good and could probably be marked ready-
for-committer.

Nathan

Re: needless complexity in StartupXLOG

2021-07-26 Thread Stephen Frost

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> So at the moment I am leaning toward the view that we should just
> remove this check entirely, as in the attached, proposed patch.

Haven't dug in deeply but at least following your explanation and
reading over the patch and the code a bit, I tend to agree.

> Really, I think we should consider going further. If it's safe to
> write an end-of-recovery record rather than a checkpoint, why not do
> so all the time? Right now we fail to do that in the above-described
> "impossible" scenario where the previous checkpoint record can't be
> read, or if we're exiting archive recovery for some reason other than
> a promotion request, or if we're in single-user mode, or if we're in
> crash recovery. Presumably, people would like to start up the server
> quickly in all of those scenarios, so the only reason not to use this
> technology all the time is if we think it's safe in some scenarios and
> not others. I can't think of a reason why it matters why we're exiting
> archive recovery, nor can I think of a reason why it matters whether
> we're in single user mode. The distinction between crash recovery and
> archive recovery does seem to matter, but if anything the crash
> recovery scenario seems simpler, because then there's only one
> timeline involved.

Yeah, tend to agree with this too ... but something I find a bit curious
is the comment:

* Insert a special WAL record to mark the end of
* recovery, since we aren't doing a checkpoint.

... immediately after setting promoted = true, and then at the end of
StartupXLOG() having:

if (promoted)
RequestCheckpoint(CHECKPOINT_FORCE);

maybe I'm missing something, but seems like that comment isn't being
terribly clear.  Perhaps we aren't doing a full checkpoint *there*, but
sure looks like we're going to do one moments later regardless of
anything else since we've set promoted to true..?

> I realize that conservatism may have played a role in this code ending
> up looking the way that it does; someone seems to have thought it
> would be better not to rely on a new idea in all cases. From my point
> of view, though, it's scary to have so many cases, especially cases
> that don't seem like they should ever be reached. I think that
> simplifying the logic here and trying to do the same things in as many
> cases as we can will lead to better robustness. Imagine if instead of
> all the hairy logic we have now we just replaced this whole if
> (IsInRecovery) stanza with this:
> 
> if (InRecovery)
> CreateEndOfRecoveryRecord();
> 
> That would be WAY easier to reason about than the rat's nest we have
> here today. Now, I am not sure what it would take to get there, but I
> think that is the direction we ought to be heading.

Agreed that simpler logic is better, provided it's correct logic, of
course.  Finding better ways to test all of this would be really nice.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-26, Tom Lane wrote:

> Alvaro, you seem to be responsible for both the existence of the separate
> get_namespace_name_or_temp function and the fact that it's being avoided
> here.  I wonder what you think about this.

The reason I didn't touch get_namespace_name then (e9a077cad379) was
that I didn't want to change the user-visible behavior for any existing
features; I was just after a way to implement dropped-object DDL trigger
tracking.  If we agree that displaying pg_temp instead of pg_temp_XXX
everywhere is an improvement, then I don't see a reason not to change
how get_namespace_name works and get rid of get_namespace_name_or_temp.

I don't see much usefulness in displaying the exact name of the temp
namespace anywhere, particularly since using "pg_temp" as a
qualification in queries already refers to the current backend's temp
namespace.  Trying to refer to it by exact name in SQL may lead to
affecting some other backend's temp objects ...

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/

Re: badly calculated width of emoji in psql

2021-07-26 Thread Jacob Champion

On Fri, 2021-07-23 at 17:42 +0200, Pavel Stehule wrote:
> čt 22. 7. 2021 v 0:12 odesílatel Jacob Champion  napsal:
> > 
> > Pavel, I'd be interested to see what your benchmarks find with this
> > code. Does this fix the original issue for you?
> 
> I can confirm that the original issue is fixed. 

Great!

> I tested performance
> 
> I had three data sets
> 
> 1. typical data - mix ascii and utf characters typical for czech
> language - 25K lines - there is very small slowdown 2ms from 24 to
> 26ms (stored file of this result has 3MB)
> 
> 2. the worst case - this reports has only emoji 1000 chars * 10K rows
> - there is more significant slowdown - from 160 ms to 220 ms (stored
> file has 39MB)

I assume the stored file size has grown with this patch, since we're
now printing the correct number of spacing/border characters?

> 3. a little bit of obscure datasets generated by \x and select * from
> pg_proc - it has 99K lines - and there are a lot of unicode
> decorations (borders). The line has 17K chars. (the stored file has
> 1.7GB)
> In this dataset I see a slowdown from 4300 to 4700 ms.

These results are lining up fairly well with my profiling. To isolate
the effects of the new algorithm (as opposed to printing time) I
redirected to /dev/null:

psql postgres -c "select repeat(unistr('\u115D'), 1000);" > /dev/null

This is what I expect to be the worst case for the new patch: a huge
string consisting of nothing but \u115D, which is in the first interval
and therefore takes the maximum number of iterations during the binary
search.

For that command, the wall time slowdown with this patch was about
240ms (from 1.128s to 1.366s, or 21% slower). Callgrind shows an
increase of 18% in the number of instructions executed with the
interval table patch, all of it coming from PQdsplen (no surprise).
PQdsplen itself has a 36% increase in instruction count for that run.

I also did a microbenchmark of PQdsplen (code attached, requires Google
Benchmark [1]). The three cases I tested were standard ASCII
characters, a smiley-face emoji, and the worst-case \u115F character.

Without the patch:

Benchmark  Time CPU   Iterations

...
BM_Ascii_mean   4.97 ns 4.97 ns5
BM_Ascii_median 4.97 ns 4.97 ns5
BM_Ascii_stddev0.035 ns0.035 ns5
...
BM_Emoji_mean   6.30 ns 6.30 ns5
BM_Emoji_median 6.30 ns 6.30 ns5
BM_Emoji_stddev0.045 ns0.045 ns5
...
BM_Worst_mean   12.4 ns 12.4 ns5
BM_Worst_median 12.4 ns 12.4 ns5
BM_Worst_stddev0.038 ns0.038 ns5

With the patch:

Benchmark  Time CPU   Iterations

...
BM_Ascii_mean   4.59 ns 4.59 ns5
BM_Ascii_median 4.60 ns 4.60 ns5
BM_Ascii_stddev0.069 ns0.068 ns5
...
BM_Emoji_mean   11.8 ns 11.8 ns5
BM_Emoji_median 11.8 ns 11.8 ns5
BM_Emoji_stddev0.059 ns0.059 ns5
...
BM_Worst_mean   18.5 ns 18.5 ns5
BM_Worst_median 18.5 ns 18.5 ns5
BM_Worst_stddev0.077 ns0.077 ns5

So an incredibly tiny improvement in the ASCII case, which is
reproducible across multiple runs and not just a fluke (I assume
because the code is smaller now and has better cache line
characteristics?). A ~90% slowdown for the emoji case, and a ~50%
slowdown for the worst-performing characters. That seems perfectly
reasonable considering we're talking about dozens of nanoseconds.

> 9% looks too high, but in absolute time it is 400ms for 99K lines and
> very untypical data, or 2ms for more typical results., 2ms are
> nothing (for interactive work). More - this is from a pspg
> perspective. In psql there can be overhead of network, protocol
> processing, formatting, and more and more, and psql doesn't need to
> calculate display width of decorations (borders), what is the reason
> for slowdowns in pspg.

Yeah. Considering the alignment code is for user display, the absolute
performance is going to dominate, and I don't see any red flags so far.
If you're regularly dealing with unbelievably huge amounts of emoji, I
think the amount of extra time we're seeing here is unlikely to be a
problem. If it is, you can always turn alignment off. (Do you rely on
horizontal alignment for lines with millions of characters, anyway?)

Laurenz, Michael, what do you think?

--Jacob

[1] https://github.com/google/benchmark
/*
 * Copyright 2021 VMware, Inc.
 * SP

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 12:49 PM Tom Lane  wrote:
> I can't really see any situation in which it's important
> to report the exact schema name of our own temp schema.

It would actually be nice if there were some easy way of getting that
for the rare situations in which there are problems. For example, if
the catalog entries get corrupted and you can't access some table in
your pg_temp schema, you might like to know which pg_temp schema
you've got so that you can be sure to examine the right catalog
entries to fix the problem or understand the problem or whatever you
are trying to do. I don't much care exactly how we go about making
that information available and I agree that showing pg_temp_NNN in
EXPLAIN output is worse than just pg_temp. I'm just saying that
concealing too thoroughly what is actually happening can be a problem
in the rare instance where troubleshooting is required.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Avoiding data loss with synchronous replication

2021-07-26 Thread Bossart, Nathan

On 7/24/21, 5:25 PM, "Andres Freund"  wrote:
> First, from the user experience side of things, the issue is that this seems
> to propose violating read-your-own-writes. Within a single connection to a
> single node. Which imo is *far* worse than seeing writes that haven't yet been
> acknowledged as replicated after a query cancel.

Right.  I suspect others will have a similar opinion.

Nathan

Re: Rename of triggers for partitioned tables

2021-07-26 Thread Alvaro Herrera

On 2021-Jul-25, Tom Lane wrote:

> Perhaps there's no actual bug there, but it's still horrible coding.
> For one thing, the implication that found could be negative is extremely
> confusing to readers.  A boolean might be better.  However, I wonder why
> you bothered with a flag in the first place.  The usual convention if
> we know there can be only one match is to just not write a loop at all,
> with a suitable comment, like this pre-existing example elsewhere in
> trigger.c:
> 
>   /* There should be at most one matching tuple */
>   if (HeapTupleIsValid(tuple = systable_getnext(tgscan)))
> 
> If you're not quite convinced there can be only one match, then it
> still shouldn't be an Assert --- a real test-and-elog would be better.

I agree that coding there was dubious.  I've removed the flag and assert.

Arne complained that there should be a unique constraint on (tgrelid,
tgparentid) which would sidestep the need for this to be a loop.  I
don't think it's really necessary, and I'm not sure how to create a
system index WHERE tgparentid <> 0.

-- 
Álvaro Herrera   39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)

Re: when the startup process doesn't (logging startup delays)

2021-07-26 Thread Robert Haas

On Mon, Jul 26, 2021 at 11:30 AM Justin Pryzby  wrote:
> > Maybe I'm missing something here, but I don't understand the purpose
> > of this. You can always combine two functions into one, but it's only
> > worth doing if you end up with less code, which doesn't seem to be the
> > case here.
>
>  4 files changed, 39 insertions(+), 71 deletions(-)

Hmm. I don't completely hate that, but I don't love it either. I don't
think the result is any easier to understand than the original, and in
some ways it's worse. In particular, you've flattened the separate
comments for each of the individual functions into a single-line
comment that is more generic than the comments it replaced. You could
fix that and you'd still be slightly ahead on LOC, but I'm not
convinced that it's actually a real win.

> > ... but I'm not exactly sure how to get there from here. Having only
> > LogStartupProgress() but having it do a giant if-statement to figure
> > out whether we're mid-phase or end-of-phase does not seem like the
> > right approach.
>
> I used a bool arg and negation to handle within a single switch.  Maybe it's
> cleaner to use a separate enum value for each DONE, and set a local done flag.

If we're going to go the route of combining the functions, I agree
that a Boolean is the way to go; I think we have some existing
precedent for 'bool finished' rather than 'bool done'.

But I kind of wonder if we should have an enum in the first place. It
feels like we've got code in a bunch of places that just exists to
decide which enum value to use, and then code someplace else that
turns around and decides which message to produce. That would be
sensible if we were using the same enum values in lots of places, but
that's not the case here. So suppose we just moved the messages to the
places where we're now calling LogStartupProgress() or
LogStartupProgressComplete()? So something like this:

if (should_report_startup_progress())
ereport(LOG,
 (errmsg("syncing data directory (syncfs), elapsed
time: %ld.%02d s, current path: %s",
   secs, (usecs / 1), path)));

Well, that doesn't quite work, because "secs" and "usecs" aren't going
to exist in the caller. We can fix that easily enough: let's just make
should_report_startup_progress take arguments long *secs, int *usecs,
and bool done. Then the above becomes:

if (should_report_startup_progress(&secs, &usecs, false))
ereport(LOG,
 (errmsg("syncing data directory (syncfs), elapsed
time: %ld.%02d s, current path: %s",
   secs, (usecs / 1), path)));

And if this were the call site that corresponds to
LogStartupProgressComplete(), we'd replace false with true. Now, the
only real disadvantage of this that I can see is that it requires the
caller to declare 'secs' and 'usecs', which is not a big deal, but
mildly annoying perhaps. I think we can do better still with a little
macro magic. Suppose we define a macro report_startup_progress(force,
msg, ...) that expands to:

{
long secs;
int usecs;
if (startup_progress_timer_expired(&secs, &usecs, force))
ereport(LOG, errmsg(msg, secs, usecs, ...));
}

Then the above just becomes this:

report_startup_progress(false, "syncing data directory (syncfs),
elapsed time: %ld.%02d s, current path: %s", path);

This would have the advantage that the strings we're using would be
present in the code that arranges to emit them, instead of being
removed to some other module, so I think it would be clearer. It would
also have the advantage of making it much easier to add further calls,
if someone feels they want to do that. You don't have to run around
and update enums and all the various things that use them, just copy
and paste the line above and adjust as required.

With this design, we avoid a lot of "action at a distance". We don't
define the message strings in a place far-removed from the code that
wants to emit them any more. When someone wants a new progress
message, they can just add another call to report_statup_progress()
wherever it needs to go and they're done. They don't have to go run
and update the enum and various switch statements. They're just done.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Simon Riggs

On Mon, 26 Jul 2021 at 17:49, Tom Lane  wrote:
>
> I wrote:
> > Simon Riggs  writes:
> >> Surely we need a test to exercise this works? Otherwise ready...
>
> > There are lots of places in the core regression tests where we'd have
> > used a temp table, except that we needed to do EXPLAIN and the results
> > would've been unstable, so we used a short-lived plain table instead.
> > Find one of those and change it to use a temp table.
>
> Hmm ... I looked through the core regression tests' usages of EXPLAIN
> VERBOSE and didn't really find any that it seemed to make sense to change
> that way.  I guess we've been more effective at programming around that
> restriction than I thought.
>
> Anyway, after looking at the 0001 patch, I think there's a pretty large
> oversight in that it doesn't touch ruleutils.c, although EXPLAIN relies
> heavily on that to print expressions and suchlike.  We could account
> for that as in the attached revision of 0001.
>
> However, I wonder whether this isn't going in the wrong direction.
> Instead of piecemeal s/get_namespace_name/get_namespace_name_or_temp/,
> we should consider just putting this behavior right into
> get_namespace_name, and dropping the separate get_namespace_name_or_temp
> function.  I can't really see any situation in which it's important
> to report the exact schema name of our own temp schema.

That sounds much better because any wholesale change like that will
affect 100s of places in extensions and it would be easier if we made
just one change in core.

-- 
Simon Riggshttp://www.EnterpriseDB.com/

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Tom Lane

I wrote:
> Simon Riggs  writes:
>> Surely we need a test to exercise this works? Otherwise ready...

> There are lots of places in the core regression tests where we'd have
> used a temp table, except that we needed to do EXPLAIN and the results
> would've been unstable, so we used a short-lived plain table instead.
> Find one of those and change it to use a temp table.

Hmm ... I looked through the core regression tests' usages of EXPLAIN
VERBOSE and didn't really find any that it seemed to make sense to change
that way.  I guess we've been more effective at programming around that
restriction than I thought.

Anyway, after looking at the 0001 patch, I think there's a pretty large
oversight in that it doesn't touch ruleutils.c, although EXPLAIN relies
heavily on that to print expressions and suchlike.  We could account
for that as in the attached revision of 0001.

However, I wonder whether this isn't going in the wrong direction.
Instead of piecemeal s/get_namespace_name/get_namespace_name_or_temp/,
we should consider just putting this behavior right into
get_namespace_name, and dropping the separate get_namespace_name_or_temp
function.  I can't really see any situation in which it's important
to report the exact schema name of our own temp schema.

On the other hand, I don't like 0002 one bit, because it's not accounting
for whether the temp schema it's mangling is *our own* temp schema or some
other session's.  I do not think it is wise or even safe to report some
other temp schema as being "pg_temp".  By the same token, I wonder whether
this bit in event_trigger.c is a good idea or a safety hazard:

/* XXX not quite get_namespace_name_or_temp */
if (isAnyTempNamespace(schema_oid))
schema = pstrdup("pg_temp");
else
schema = get_namespace_name(schema_oid);

Alvaro, you seem to be responsible for both the existence of the separate
get_namespace_name_or_temp function and the fact that it's being avoided
here.  I wonder what you think about this.

regards, tom lane

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index f15c97ad7a..51fac77f3d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2854,7 +2854,7 @@ postgresExplainForeignScan(ForeignScanState *node, ExplainState *es)
 {
 	char	   *namespace;

-	namespace = get_namespace_name(get_rel_namespace(rte->relid));
+	namespace = get_namespace_name_or_temp(get_rel_namespace(rte->relid));
 	appendStringInfo(relations, "%s.%s",
 	 quote_identifier(namespace),
 	 quote_identifier(relname));
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 340db2bac4..36fbe129cf 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3747,7 +3747,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)
 			Assert(rte->rtekind == RTE_RELATION);
 			objectname = get_rel_name(rte->relid);
 			if (es->verbose)
-namespace = get_namespace_name(get_rel_namespace(rte->relid));
+namespace = get_namespace_name_or_temp(get_rel_namespace(rte->relid));
 			objecttag = "Relation Name";
 			break;
 		case T_FunctionScan:
@@ -3774,8 +3774,7 @@ ExplainTargetRel(Plan *plan, Index rti, ExplainState *es)

 		objectname = get_func_name(funcid);
 		if (es->verbose)
-			namespace =
-get_namespace_name(get_func_namespace(funcid));
+			namespace = get_namespace_name_or_temp(get_func_namespace(funcid));
 	}
 }
 objecttag = "Function Name";
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e7108f903..4df8cc5abf 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -1617,7 +1617,7 @@ pg_get_statisticsobj_worker(Oid statextid, bool columns_only, bool missing_ok)

 	if (!columns_only)
 	{
-		nsp = get_namespace_name(statextrec->stxnamespace);
+		nsp = get_namespace_name_or_temp(statextrec->stxnamespace);
 		appendStringInfo(&buf, "CREATE STATISTICS %s",
 		 quote_qualified_identifier(nsp,
 	NameStr(statextrec->stxname)));
@@ -2811,7 +2811,7 @@ pg_get_functiondef(PG_FUNCTION_ARGS)
 	 * We always qualify the function name, to ensure the right function gets
 	 * replaced.
 	 */
-	nsp = get_namespace_name(proc->pronamespace);
+	nsp = get_namespace_name_or_temp(proc->pronamespace);
 	appendStringInfo(&buf, "CREATE OR REPLACE %s %s(",
 	 isfunction ? "FUNCTION" : "PROCEDURE",
 	 quote_qualified_identifier(nsp, name));
@@ -11183,7 +11183,7 @@ get_opclass_name(Oid opclass, Oid actual_datatype,
 			appendStringInfo(buf, " %s", quote_identifier(opcname));
 		else
 		{
-			nspname = get_namespace_name(opcrec->opcnamespace);
+			nspname = get_namespace_name_or_temp(opcrec->opcnamespace);
 			appendStringInfo(buf, " %s.%s

Re: .ready and .done files considered harmful

2021-07-26 Thread Bossart, Nathan

On 7/26/21, 6:31 AM, "Robert Haas"  wrote:
> In terms of immediate next steps, I think we should focus on
> eliminating the O(n^2) problem and not get sucked into a bigger
> redesign. The patch on the table aims to do just that much and I think
> that's a good thing.

I agree.  I'll leave further discussion about a redesign for another
thread.

Nathan

needless complexity in StartupXLOG

2021-07-26 Thread Robert Haas

StartupXLOG() has code beginning around line 7900 of xlog.c that
decides, at the end of recovery, between four possible courses of
action. It either writes an end-of-recovery record, or requests a
checkpoint, or creates a checkpoint, or does nothing, depending on the
value of 3 flag variables, and also on whether we're still able to
read the last checkpoint record:

checkPointLoc = ControlFile->checkPoint;

/*
 * Confirm the last checkpoint is available for us to recover
 * from if we fail.
 */
record = ReadCheckpointRecord(xlogreader,
checkPointLoc, 1, false);
if (record != NULL)
{
promoted = true;

It seems to me that ReadCheckpointRecord() should never fail here. It
should always be true, even when we're not in recovery, that the last
checkpoint record is readable. If there's ever a situation where that
is not true, even for an instant, then a crash at that point will be
unrecoverable. Now, one way that it could happen is if we've got a bug
in the code someplace that removes WAL segments too soon. However, if
we have such a bug, we should fix it. What this code does is says "oh,
I guess we removed the old checkpoint too soon, no big deal, let's
just be more aggressive about getting the next one done," which I do
not think is the right response. Aside from a bug, the only other way
I can see it happening is if someone is manually removing WAL segments
as the server is running through recovery, perhaps as some last-ditch
play to avoid running out of disk space. I don't think the server
needs to have - or should have - code to cater to such crazy
scenarios. Therefore I think that this check, at least in its current
form, is not sensible.

My first thought was that we should do the check unconditionally,
rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and
ERROR if it fails. But then I wondered what the point of that would be
exactly. If we have such a bug -- and to the best of my knowledge
there's no evidence that we do -- there's no particular reason it
should only happen at the end of recovery. It could happen any time
the system -- or the user, or malicious space aliens -- remove files
from pg_wal, and we have no real idea about the timing of malicious
space alien activity, so doing the test here rather than anywhere else
just seems like a shot in the dark. Perhaps the most logical place
would be to move it to CreateCheckPoint() just after we remove old
xlog files, but we don't have the xlogreader there, and anyway I don't
see how it's really going to help. What bug do we expect to catch by
removing files we think we don't need and then checking that we didn't
remove the files we think we do need? That seems more like grasping at
straws than a serious attempt to make things work better.

So at the moment I am leaning toward the view that we should just
remove this check entirely, as in the attached, proposed patch.

Really, I think we should consider going further. If it's safe to
write an end-of-recovery record rather than a checkpoint, why not do
so all the time? Right now we fail to do that in the above-described
"impossible" scenario where the previous checkpoint record can't be
read, or if we're exiting archive recovery for some reason other than
a promotion request, or if we're in single-user mode, or if we're in
crash recovery. Presumably, people would like to start up the server
quickly in all of those scenarios, so the only reason not to use this
technology all the time is if we think it's safe in some scenarios and
not others. I can't think of a reason why it matters why we're exiting
archive recovery, nor can I think of a reason why it matters whether
we're in single user mode. The distinction between crash recovery and
archive recovery does seem to matter, but if anything the crash
recovery scenario seems simpler, because then there's only one
timeline involved.

I realize that conservatism may have played a role in this code ending
up looking the way that it does; someone seems to have thought it
would be better not to rely on a new idea in all cases. From my point
of view, though, it's scary to have so many cases, especially cases
that don't seem like they should ever be reached. I think that
simplifying the logic here and trying to do the same things in as many
cases as we can will lead to better robustness. Imagine if instead of
all the hairy logic we have now we just replaced this whole if
(IsInRecovery) stanza with this:

if (InRecovery)
CreateEndOfRecoveryRecord();

That would be WAY easier to reason about than the rat's nest we have
here today. Now, I am not sure what it would take to get there, but I
think that is the direction we ought to be heading.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


0001-Remove-pointless-call-to-ReadCheckpointRecord.patch
Description: Binary data

Re: Inaccurate error message when set fdw batch_size to 0

2021-07-26 Thread Fujii Masao




On 2021/07/26 13:56, Bharath Rupireddy wrote:

On Thu, Jul 15, 2021 at 7:54 PM Bharath Rupireddy
 wrote:


On Mon, Jul 12, 2021 at 10:11 PM Bharath Rupireddy
 wrote:


On Mon, Jul 12, 2021 at 9:20 PM Fujii Masao  wrote:

+  
+   Avoid Using non-negative Word in Error 
Messages
+
+   
+Do not use non-negative word in error messages as it looks
+ambiguous. Instead, use foo must be an integer value greater than 
zero
+or  foo must be an integer value greater than or equal to 
zero
+if option foo expects an integer value.
+   
+  

It seems suitable to put this guide under "Tricky Words to Avoid"
rather than adding it as separate section. Thought?


+1. I will change.


PSA v7 patch with the above change.


PSA v8 patch rebased on to latest master.


Thanks for updating the patch!

+  
+non-negative
+   
+Do not use non-negative word in error messages as it looks
+ambiguous. Instead, use foo must be an integer value greater than
+zero or foo must be an integer value greater than or equal
+to zero if option foo expects an integer value.
+   
+  

This description looks a bit redundant. And IMO it's better to also document how "non-negative" is 
ambiguous. So what about the following description, instead? I replaced this description with the following. 
Patch attached. I also uppercased the first character "n" of "non-negative" at the title 
for the sake of consistency with other items.

+  
+Non-negative
+   
+Avoid non-negative as it is ambiguous
+about whether it accepts zero.  It's better to use
+greater than zero or
+greater than or equal to zero.
+   
+  


-   /* Number of workers should be non-negative. */
+   /* Number of parallel workers should be greater than zero. */
Assert(nworkers >= 0);

This should be "greater than or equal to zero", instead? Anyway since this is comment not an error 
message, and also there are still other comments using "non-negative", I don't think we need to 
change only this comment for now. So I excluded this change from the patch. Maybe we can get rid of all 
"non-negative" from comments and documents later *if* necessary.


-errmsg("repeat count size must be a non-negative 
integer")));
+errmsg("repeat count size must be greater than or 
equal to zero")));

-errmsg("number of workers must be a non-negative 
integer")));
+errmsg("number of workers must be greater than or 
equal to zero")));

Isn't it better to replace "be greater" with "be an integer value greater"? I 
applied this to the patch.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 4593cbc540..c574ca2cf3 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -119,7 +119,10 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 strcmp(def->defname, "fdw_tuple_cost") == 0)
{
-   /* these must have a non-negative numeric value */
+   /*
+* These must have a floating point value greater than 
or equal to
+* zero.
+*/
char   *value;
double  real_val;
boolis_parsed;
@@ -136,7 +139,7 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
if (real_val < 0)
ereport(ERROR,

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("\"%s\" requires a 
non-negative floating point value",
+errmsg("\"%s\" must be a 
floating point value greater than or equal to zero",
def->defname)));
}
else if (strcmp(def->defname, "extensions") == 0)
@@ -163,7 +166,7 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
if (int_val <= 0)
ereport(ERROR,

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
-errmsg("\"%s\" requires a 
non-negative integer value",
+errmsg("\"%s\" must be an 
integer value greater than zero",
def->defname)));
}
else if (strcmp(def->defname, "password_required") == 0)
diff --git a/doc/src/sgml/sources.sgml b/doc/src/sgml/sources.sgml
index 3f2c40b750..e6ae02f2af 100644
--- a/do

Re: Case expression pushdown

2021-07-26 Thread Alexander Pyhalov


Tom Lane писал 2021-07-26 18:18:

Alexander Pyhalov  writes:

[ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ]


This doesn't compile cleanly:

deparse.c: In function 'foreign_expr_walker.isra.4':
deparse.c:920:8: warning: 'collation' may be used uninitialized in
this function [-Wmaybe-uninitialized]
 if (collation != outer_cxt->collation)
^
deparse.c:914:3: warning: 'state' may be used uninitialized in this
function [-Wmaybe-uninitialized]
   switch (state)
   ^~

These uninitialized variables very likely explain the fact that it 
fails
regression tests, both for me and for the cfbot.  Even if this weren't 
an

outright bug, we don't tolerate code that produces warnings on common
compilers.

regards, tom lane


Hi.

Of course, this is a patch issue. Don't understand how I overlooked 
this.
Rebased on master and fixed it. Tests are passing here (but they also 
passed for previous patch version).


What exact tests are failing?

--
Best regards,
Alexander Pyhalov,
Postgres ProfessionalFrom 9c9fa2e37fc62ddcd8dc6176306d74b7e219fd26 Mon Sep 17 00:00:00 2001
From: Alexander Pyhalov 
Date: Thu, 22 Jul 2021 11:42:16 +0300
Subject: [PATCH] Allow pushing CASE expression to foreign server

---
 contrib/postgres_fdw/deparse.c| 150 ++
 .../postgres_fdw/expected/postgres_fdw.out| 184 ++
 contrib/postgres_fdw/sql/postgres_fdw.sql |  63 ++
 3 files changed, 397 insertions(+)

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 31919fda8c6..df1aaf8e713 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -186,6 +186,8 @@ static void appendFunctionName(Oid funcid, deparse_expr_cxt *context);
 static Node *deparseSortGroupClause(Index ref, List *tlist, bool force_colno,
 	deparse_expr_cxt *context);
 
+static void deparseCaseExpr(CaseExpr *node, deparse_expr_cxt *context);
+
 /*
  * Helper functions
  */
@@ -627,6 +629,105 @@ foreign_expr_walker(Node *node,
 state = FDW_COLLATE_NONE;
 			}
 			break;
+		case T_CaseExpr:
+			{
+ListCell   *lc;
+foreign_loc_cxt tmp_cxt;
+CaseExpr   *ce = (CaseExpr *) node;
+
+/*
+ * case arg expression collation doesn't affect result
+ * collation
+ */
+tmp_cxt.collation = InvalidOid;
+tmp_cxt.state = FDW_COLLATE_NONE;
+if (ce->arg && !foreign_expr_walker((Node *) ce->arg, glob_cxt, &tmp_cxt))
+	return false;
+
+/* Recurse to case clause subexpressions. */
+foreach(lc, ce->args)
+{
+	CaseWhen   *cw = lfirst_node(CaseWhen, lc);
+	Node	   *whenExpr = (Node *) cw->expr;
+
+	/*
+	 * The parser should have produced WHEN clauses of the
+	 * form "CaseTestExpr = RHS", possibly with an implicit
+	 * coercion inserted above the CaseTestExpr. However in an
+	 * expression that's been through the optimizer, the WHEN
+	 * clause could be almost anything (since the equality
+	 * operator could have been expanded into an inline
+	 * function). In this case forbid pushdown.
+	 */
+
+	if (ce->arg)
+	{
+		List	   *whenExprArgs;
+
+		if (!IsA(whenExpr, OpExpr))
+			return false;
+
+		whenExprArgs = ((OpExpr *) whenExpr)->args;
+
+		if ((list_length(whenExprArgs) != 2) ||
+			!IsA(strip_implicit_coercions(linitial(whenExprArgs)), CaseTestExpr))
+			return false;
+	}
+
+	/*
+	 * case when expression collation doesn't affect result
+	 * collation
+	 */
+	tmp_cxt.collation = InvalidOid;
+	tmp_cxt.state = FDW_COLLATE_NONE;
+	/* Recurse to case clause expression. */
+	if (!foreign_expr_walker((Node *) cw->expr,
+			 glob_cxt, &tmp_cxt))
+		return false;
+
+	/* Recurse to result expression. */
+	if (!foreign_expr_walker((Node *) cw->result,
+			 glob_cxt, &inner_cxt))
+		return false;
+}
+
+if (!foreign_expr_walker((Node *) ce->defresult, glob_cxt, &inner_cxt))
+	return false;
+
+/*
+ * Collation rule is same as for function nodes.
+ */
+collation = ce->casecollid;
+if (collation == InvalidOid)
+	state = FDW_COLLATE_NONE;
+else if (inner_cxt.state == FDW_COLLATE_SAFE &&
+		 collation == inner_cxt.collation)
+	state = FDW_COLLATE_SAFE;
+else if (collation == DEFAULT_COLLATION_OID)
+	state = FDW_COLLATE_NONE;
+else
+	state = FDW_COLLATE_UNSAFE;
+			}
+			break;
+		case T_CaseTestExpr:
+			{
+CaseTestExpr *c = (CaseTestExpr *) node;
+
+/*
+ * Collation rule is same as for function nodes.
+ */
+collation = c->collation;
+if (collation == InvalidOid)
+	state = FDW_COLLATE_NONE;
+else if (inner_cxt.state == FDW_COLLATE_SAFE &&
+		 collation == inner_cxt.collation)
+	state = FDW_COLLATE_SAFE;
+else if (collation == DEFAULT_COLLATION_OID)
+	state = FDW_COLLATE_NONE;
+else
+	state = FDW_COLLATE_UNSAFE;
+			}
+			break;
 		c

Re: when the startup process doesn't (logging startup delays)

2021-07-26 Thread Justin Pryzby

On Mon, Jul 26, 2021 at 10:13:09AM -0400, Robert Haas wrote:
> I don't think walkdir() has any business calling LogStartupProgress()
> at all. It's supposed to be a generic function, not one that is only
> available to be called from the startup process, or has different
> behavior there. From my point of view, the right thing is to put the
> logging calls into the particular callbacks that SyncDataDirectory()
> uses.

You're right - this is better.

On Wed, Jul 21, 2021 at 04:47:32PM +0530, Bharath Rupireddy wrote:
> > > 1) I still don't see the need for two functions LogStartupProgress and
> > > LogStartupProgressComplete. Most of the code is duplicate. I think we
> > > can just do it with a single function something like [1]:
> >
> > I agree that one function can do this more succinctly.  I think it's best to
> > use a separate enum value for START operations and END operations.
> 
> Maybe I'm missing something here, but I don't understand the purpose
> of this. You can always combine two functions into one, but it's only
> worth doing if you end up with less code, which doesn't seem to be the
> case here.

 4 files changed, 39 insertions(+), 71 deletions(-)

> ... but I'm not exactly sure how to get there from here. Having only
> LogStartupProgress() but having it do a giant if-statement to figure
> out whether we're mid-phase or end-of-phase does not seem like the
> right approach.

I used a bool arg and negation to handle within a single switch.  Maybe it's
cleaner to use a separate enum value for each DONE, and set a local done flag.

 startup[29675] LOG:  database system was interrupted; last known up at 
2021-07-26 10:23:04 CDT
 startup[29675] LOG:  syncing data directory (fsync), elapsed time: 1.38 s, 
current path: ./pg_ident.conf
 startup[29675] LOG:  data directory sync (fsync) complete after 1.72 s
 startup[29675] LOG:  database system was not properly shut down; automatic 
recovery in progress
 startup[29675] LOG:  resetting unlogged relations (cleanup) complete after 
0.00 s
 startup[29675] LOG:  redo starts at 0/17BE500
 startup[29675] LOG:  redo in progress, elapsed time: 1.00 s, current LSN: 
0/35D7CB8
 startup[29675] LOG:  redo in progress, elapsed time: 2.00 s, current LSN: 
0/54A6918
 startup[29675] LOG:  redo in progress, elapsed time: 3.00 s, current LSN: 
0/7370570
 startup[29675] LOG:  redo in progress, elapsed time: 4.00 s, current LSN: 
0/924D8A0
 startup[29675] LOG:  redo done at 0/9B8 system usage: CPU: user: 4.28 s, 
system: 0.15 s, elapsed: 4.44 s
 startup[29675] LOG:  resetting unlogged relations (init) complete after 0.03 s
 startup[29675] LOG:  checkpoint starting: end-of-recovery immediate
 startup[29675] LOG:  checkpoint complete: wrote 9872 buffers (60.3%); 0 WAL 
file(s) added, 0 removed, 8 recycled; write=0.136 s, sync=0.801 s, total=1.260 
s; sync files=21, longest=0.774 s, average=B

>From 9443005040be52225498d58678b5faa1668bd2ad Mon Sep 17 00:00:00 2001
From: Nitin 
Date: Fri, 23 Jul 2021 15:46:56 +0530
Subject: [PATCH 1/2] Shows the progress of the operations performed during
 startup process. The interval between each progress update is configurable
 and it should be mentioned in seconds

---
 src/backend/access/transam/xlog.c | 178 ++
 src/backend/postmaster/startup.c  |   1 +
 src/backend/storage/file/fd.c |  13 ++
 src/backend/storage/file/reinit.c |  18 ++
 src/backend/utils/misc/guc.c  |  13 +-
 src/backend/utils/misc/postgresql.conf.sample |   6 +
 src/include/access/xlog.h |  18 ++
 src/include/utils/timeout.h   |   1 +
 8 files changed, 247 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3479402272..4f052050f3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -79,6 +79,7 @@
 #include "utils/relmapper.h"
 #include "utils/pg_rusage.h"
 #include "utils/snapmgr.h"
+#include "utils/timeout.h"
 #include "utils/timestamp.h"
 
 extern uint32 bootstrap_data_checksum_version;
@@ -397,6 +398,23 @@ static bool doRequestWalReceiverReply;
  */
 static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
 
+/*
+ * Start timestamp of the operation that occur during startup process.
+ */
+static TimestampTz startupProcessOpStartTime;
+
+/*
+ * Indicates whether the startup progress interval mentioned by the user is
+ * elapsed or not. TRUE if timeout occurred, FALSE otherwise.
+ */
+static bool logStartupProgressTimeout;
+
+/*
+ * The interval between each progress update of the operations that occur
+ * during startup process.
+ */
+int	log_startup_progress_interval;
+
 /*--
  * Shared-memory data structures for XLOG control
  *
@@ -7351,6 +7369,8 @@ StartupXLOG(void)
 	(errmsg("redo starts at %X/%X",
 			LSN_FORMAT_ARGS(ReadRecPtr;
 
+			InitStartupProgress();
+
 			/*
 			 * main redo apply loop
 			 */
@@ -7358,6 +7378,8 @@

Re: Case expression pushdown

2021-07-26 Thread Tom Lane

Alexander Pyhalov  writes:
> [ 0001-Allow-pushing-CASE-expression-to-foreign-server-v6.patch ]

This doesn't compile cleanly:

deparse.c: In function 'foreign_expr_walker.isra.4':
deparse.c:920:8: warning: 'collation' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
 if (collation != outer_cxt->collation)
^
deparse.c:914:3: warning: 'state' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
   switch (state)
   ^~

These uninitialized variables very likely explain the fact that it fails
regression tests, both for me and for the cfbot.  Even if this weren't an
outright bug, we don't tolerate code that produces warnings on common
compilers.

regards, tom lane

Re: [bug?] Missed parallel safety checks, and wrong parallel safety

2021-07-26 Thread Robert Haas

On Sat, Jul 24, 2021 at 5:52 AM Amit Kapila  wrote:
> I think for the consistency argument how about allowing users to
> specify a parallel-safety option for both partitioned and
> non-partitioned relations but for non-partitioned relations if users
> didn't specify, it would be computed automatically? If the user has
> specified parallel-safety option for non-partitioned relation then we
> would consider that instead of computing the value by ourselves.

Having the option for both partitioned and non-partitioned tables
doesn't seem like the worst idea ever, but I am also not entirely sure
that I understand the point.

> Another reason for hesitation to do automatically for non-partitioned
> relations was the new invalidation which will invalidate the cached
> parallel-safety for all relations in relcache for a particular
> database. As mentioned by Hou-San [1], it seems we need to do this
> whenever any function's parallel-safety is changed. OTOH, changing
> parallel-safety for a function is probably not that often to matter in
> practice which is why I think you seem to be fine with this idea.

Right. I think it should be quite rare, and invalidation events are
also not crazy expensive. We can test what the worst case is, but if
you have to sit there and run ALTER FUNCTION in a tight loop to see a
measurable performance impact, it's not a real problem. There may be a
code complexity argument against trying to figure it out
automatically, perhaps, but I don't think there's a big performance
issue.

What bothers me is that if this is something people have to set
manually then many people won't and will not get the benefit of the
feature. And some of them will also set it incorrectly and have
problems. So I am in favor of trying to determine it automatically
where possible, to make it easy for people. However, other people may
feel differently, and I'm not trying to say they're necessarily wrong.
I'm just telling you what I think.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Skip temporary table schema name from explain-verbose output.

2021-07-26 Thread Tom Lane

Simon Riggs  writes:
> Sounds like a good change.
> Surely we need a test to exercise this works? Otherwise ready...

There are lots of places in the core regression tests where we'd have
used a temp table, except that we needed to do EXPLAIN and the results
would've been unstable, so we used a short-lived plain table instead.
Find one of those and change it to use a temp table.

regards, tom lane

Re: when the startup process doesn't (logging startup delays)

2021-07-26 Thread Robert Haas

On Sun, Jul 25, 2021 at 1:56 PM Justin Pryzby  wrote:
> On Fri, Jul 23, 2021 at 04:09:47PM +0530, Nitin Jadhav wrote:
> > > I think walkdir() should only call LogStartupProgress(FSYNC_IN_PROGRESS, 
> > > path);
> > > when action == datadir_fsync_fname.
> >
> > I agree and fixed it.
>
> I saw that you fixed it by calling InitStartupProgress() after the walkdir()
> calls which do pre_sync_fname.  So then walkdir is calling
> LogStartupProgress(STARTUP_PROCESS_OP_FSYNC) even when it's not doing fsync,
> and then LogStartupProgress() is returning because !AmStartupProcess().
>
> That seems indirect, fragile, and confusing.  I suggest that walkdir() should
> take an argument for which operation to pass to LogStartupProgress().  You can
> pass a special enum for cases where nothing should be logged, like
> STARTUP_PROCESS_OP_NONE.

I don't think walkdir() has any business calling LogStartupProgress()
at all. It's supposed to be a generic function, not one that is only
available to be called from the startup process, or has different
behavior there. From my point of view, the right thing is to put the
logging calls into the particular callbacks that SyncDataDirectory()
uses.

> On Wed, Jul 21, 2021 at 04:47:32PM +0530, Bharath Rupireddy wrote:
> > 1) I still don't see the need for two functions LogStartupProgress and
> > LogStartupProgressComplete. Most of the code is duplicate. I think we
> > can just do it with a single function something like [1]:
>
> I agree that one function can do this more succinctly.  I think it's best to
> use a separate enum value for START operations and END operations.

Maybe I'm missing something here, but I don't understand the purpose
of this. You can always combine two functions into one, but it's only
worth doing if you end up with less code, which doesn't seem to be the
case here. The strings are all different and that's most of the
function, and the other stuff that gets done isn't the same either, so
you'd just end up with a bunch of if-statements. That doesn't seem
like an improvement.

Thinking further, I guess I understand it from the caller's
perspective. It's not necessarily clear why in some places we call
LogStartupProgress() and others LogStartupProgressComplete(). Someone
might expect a function with "complete" in the name like that to only
be called once at the very end, rather than once at the end of a
phase, and it does sort of make sense that you'd want to call one
function everywhere rather than sometimes one and sometimes the other
... but I'm not exactly sure how to get there from here. Having only
LogStartupProgress() but having it do a giant if-statement to figure
out whether we're mid-phase or end-of-phase does not seem like the
right approach.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: SQL/JSON: JSON_TABLE

2021-07-26 Thread Daniel Gustafsson

Below are a few small comments from a casual read-through.  I noticed that
there was a new version posted after I had finished perusing, but it seems to
address other aspects.

+ Gerenates a column and inserts a composite SQL/JSON
s/Gerenates/Generates/

+  into both child and parrent columns for all missing values.
s/parrent/parent/

- objectname = "xmltable";
+ objectname = rte->tablefunc ?
+ rte->tablefunc->functype == TFT_XMLTABLE ?
+ "xmltable" : "json_table" : NULL;
In which case can rte->tablefunc be NULL for a T_TableFuncScan?  Also, nested
ternary operators are confusing at best, I think this should be rewritten as
plain if statements.

In general when inspecting functype I think it's better to spell it out with if
statements rather than ternary since it allows for grepping the code easier.
Having to grep for TFT_XMLTABLE to find json_table isn't all that convenient.
That also removes the need for comments stating why a ternary operator is Ok in
the first place.

+ errmsg("JSON_TABLE() is not yet implemented for json type"),
I can see this being potentially confusing to many, en errhint with a reference
to jsonb seems like a good idea.

+/* Recursively register column name in the path name list. */
+static void
+registerJsonTableColumn(JsonTableContext *cxt, char *colname)
This comment is misleading since the function isn't actually recursive, but a
helper function for a recursive function.

+switch (get_typtype(typid))
+{
+case TYPTYPE_COMPOSITE:
+return true;
+
+case TYPTYPE_DOMAIN:
+return typeIsComposite(getBaseType(typid));
+}
switch statements without a default runs the risk of attracting unwanted
compiler warning attention, or make static analyzers angry.  This one can
easily be rewritten with two if-statements on a cached get_typtype()
returnvalue.

+ * Returned false at the end of a scan, true otherwise.
s/Returned/Returns/ (applies at two places)

+ /* state->ordinal--; */ /* skip current outer row, reset counter */
Is this dead code to be removed, or left in there as a reminder to fix
something?

--
Daniel Gustafsson   https://vmware.com/

Re: [PoC] Improve dead tuple storage for lazy vacuum

2021-07-26 Thread Masahiko Sawada

On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov  wrote:
>
> Hi,
>
> I've dreamed to write more compact structure for vacuum for three
> years, but life didn't give me a time to.
>
> Let me join to friendly competition.
>
> I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Thank you for proposing the new idea!

>
> Novelties:
> - 32 consecutive pages are stored together in a single sparse array
>(called "chunks").
>Chunk contains:
>- its number,
>- 4 byte bitmap of non-empty pages,
>- array of non-empty page headers 2 byte each.
>  Page header contains offset of page's bitmap in bitmaps container.
>  (Except if there is just one dead tuple in a page. Then it is
>  written into header itself).
>- container of concatenated bitmaps.
>
>Ie, page metadata overhead varies from 2.4byte (32pages in single
> chunk)
>to 18byte (1 page in single chunk) per page.
>
> - If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
>it is compressed by removing zero byte and indexing with two-level
>bitmap index.
>Two-level index - zero bytes in first level are removed using
>second level. It is mostly done for 32kb pages, but let it stay since
>it is almost free.
>
> - If page's bitmaps contains a lot of "all-one" bytes, it is inverted
>and then encoded as sparse.
>
> - Chunks are allocated with custom "allocator" that has no
>per-allocation overhead. It is possible because there is no need
>to perform "free": allocator is freed as whole at once.
>
> - Array of pointers to chunks is also bitmap indexed. It saves cpu time
>when not every 32 consecutive pages has at least one dead tuple.
>But consumes time otherwise. Therefore additional optimization is
> added
>to quick skip lookup for first non-empty run of chunks.
>(Ahhh, I believe this explanation is awful).

It sounds better than my proposal.

>
> Andres Freund wrote 2021-07-20 02:49:
> > Hi,
> >
> > On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:
> >> BTW is the implementation of the radix tree approach available
> >> somewhere? If so I'd like to experiment with that too.
> >>
> >> >
> >> > I have toyed with implementing adaptively large radix nodes like
> >> > proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
> >> > gotten it quite working.
> >>
> >> That seems promising approach.
> >
> > I've since implemented some, but not all of the ideas of that paper
> > (adaptive node sizes, but not the tree compression pieces).
> >
> > E.g. for
> >
> > select prepare(
> > 100, -- max block
> > 20, -- # of dead tuples per page
> > 10, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> > attach  sizeshuffled  ordered
> > array69 ms  120 MB  84.87 s  8.66 s
> > intset  173 ms   65 MB  68.82 s 11.75 s
> > rtbm201 ms   67 MB  11.54 s  1.35 s
> > tbm 232 ms  100 MB   8.33 s  1.26 s
> > vtbm162 ms   58 MB  10.01 s  1.22 s
> > radix88 ms   42 MB  11.49 s  1.67 s
> >
> > and for
> > select prepare(
> > 100, -- max block
> > 10, -- # of dead tuples per page
> > 1, -- dead tuples interval within a page
> > 1 -- page inteval
> > );
> >
> > attach  sizeshuffled  ordered
> > array24 ms   60MB   3.74s1.02 s
> > intset   97 ms   49MB   3.14s0.75 s
> > rtbm138 ms   36MB   0.41s0.14 s
> > tbm 198 ms  101MB   0.41s0.14 s
> > vtbm118 ms   27MB   0.39s0.12 s
> > radix33 ms   10MB   0.28s0.10 s
> >
> > (this is an almost unfairly good case for radix)
> >
> > Running out of time to format the results of the other testcases before
> > I have to run, unfortunately. radix uses 42MB both in test case 3 and
> > 4.
>
> My results (Ubuntu 20.04 Intel Core i7-1165G7):
>
> Test1.
>
> select prepare(100, 10, 20, 1); -- original
>
> attach  size   shuffled
> array   29ms60MB   93.99s
> intset  93ms49MB   80.94s
> rtbm   171ms67MB   14.05s
> tbm238ms   100MB8.36s
> vtbm   148ms59MB9.12s
> radix  100ms42MB   11.81s
> svtm75ms29MB8.90s
>
> select prepare(100, 20, 10, 1); -- Andres's variant
>
> attach  size   shuffled
> array   61ms   120MB  111.91s
> intset 163ms66MB   85.00s
> rtbm   236ms67MB   10.72s
> tbm290ms   100MB8.40s
> vtbm   190ms59MB9.28s
> radix  117ms42MB   12.00s
> svtm98ms29MB8.77s
>
> Test2.
>
> select prepare(100, 10, 1, 1);
>
> attach  size   shuffled
> array   31ms60MB4.68s
> intset  97ms49MB4.03s
> rtbm   163ms36MB0.42s
> tbm240ms   100MB0.42s
> vtbm   136ms27MB0.36s
> radix   60ms10MB0.72s
> svtm39ms 6MB0.19s
>
> (Bad radix result probably due to smaller cache in notebook's CPU ?)
>
> Test3
>
> select prepare(100, 2, 100, 1);
>
> attach  size   shuff

1 2 >

1 - 100 of 118 matches

Mail list logo