Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Mar 15, 2021 at 11:34 AM Justin Pryzby wrote: > > On Mon, Mar 15, 2021 at 11:25:26AM +0530, Amit Kapila wrote: > > On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com wrote: > > > > > > Attaching new version patch with this change. > > > > Thanks, the patch looks good to me. I have made some minor cosmetic > > changes in the attached. I am planning to push this by tomorrow unless > > you or others have any more comments or suggestions for this patch. > > I still wonder if it should be possible for the GUC to be off, and then > selectively enable parallel inserts on specific tables. > I think that could be inconvenient for users because in most tables such a restriction won't need to be applied. -- With Regards, Amit Kapila.
Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW
On Mon, Mar 15, 2021 at 10:39 AM Thomas Munro wrote: > > On Mon, Jan 4, 2021 at 9:05 PM Luc Vlaming wrote: > > While reading some back history, I saw that commit e9baa5e9 introduced > parallelism for CREATE M V, but REFRESH was ripped out of the original > patch by Robert, who said: > > > The problem with a case like REFRESH MATERIALIZED VIEW is that there's > > nothing to prevent something that gets run in the course of the query > > from trying to access the view (and the heavyweight lock won't prevent > > that, due to group locking). That's probably a stupid thing to do, > > but it can't be allowed to break the world. The other cases are safe > > from that particular problem because the table doesn't exist yet. > > Hmmm. > I am not sure but we might want to add this in comments of the refresh materialize view function so that it would be easier for future readers to understand why we have not enabled parallelism for this case. -- With Regards, Amit Kapila.
Re: PATCH: Attempt to make dbsize a bit more consistent
On Wed, Feb 24, 2021 at 02:35:51PM +, gkokola...@pm.me wrote: > Now with attachment. Apologies for the chatter. The patch has no documentation for the two new functions, so it is a bit hard to understand what is the value brought here, and what is the goal wanted just by reading the patch, except that this switches the size reported for views to NULL instead of zero bytes by reading the regression tests. Please note that reporting zero is not wrong for views IMO as these have no physical storage, so, while it is possible to argue that a size of zero could imply that this relation size could not be zero, it will never change, so I'd rather let this behavior as it is. I would bet, additionally, that this could break existing use cases. Reporting NULL, on the contrary, as your patch does, makes things worse in some ways because that's what pg_relation_size would report when a relation does not exist anymore. Imagine for example the case of a DROP TABLE run in parallel of a scan of pg_class using pg_relation_size(). So it becomes impossible to know if a relation has been removed or not. This joins some points raised by Soumyadeep upthread. Anyway, as mentioned by other people upthread, I am not really convinced either that we should have more flavors of size functions, particularly depending on the relkind as this would be confusing for the end-user. pg_relation_size() can already do this job for all relkinds that use segment files, so it should still be able to hold its ground and maintain a consistent behavior with what it does currently. +static int64 +calculate_table_fork_size(Relation rel, ForkNumber forkNum) +{ + return (int64)table_relation_size(rel, forkNum); +} Why isn't that just unsigned, like table_relation_size()? This is an internal function so it does not matter to changes its signature, but I think that you had better do a cast at a higher level instead. for (int i = 0; i < MAX_FORKNUM; i++) - nblocks += smgrnblocks(rel->rd_smgr, i); + nblocks += smgrexists(rel->rd_smgr, i) + ? smgrnblocks(rel->rd_smgr, i) + : 0; Is that actually the part for views? Why is that done this way? + if (rel->rd_rel->relkind == RELKIND_RELATION || + rel->rd_rel->relkind == RELKIND_TOASTVALUE || + rel->rd_rel->relkind == RELKIND_MATVIEW) + size = calculate_table_fork_size(rel, +forkname_to_number(text_to_cstring(forkName))); + else if (rel->rd_rel->relkind == RELKIND_INDEX) + size = calculate_relation_size(&(rel->rd_node), rel->rd_backend, +forkname_to_number(text_to_cstring(forkName))); + else + { + relation_close(rel, AccessShareLock); + PG_RETURN_NULL(); + } The approach you are taking to bring some consistency in all that by making the size estimations go through table AMs via table_relation_size() is promising. However, this code breaks the size estimation for sequences, which is not good. If attempting to use an evaluation that's based on a table AM, shouldn't this code use a check based on rd_tableam rather than the relkind then? We assume that rd_tableam is set depending on the relkind in relcache.c, for one. -- Michael signature.asc Description: PGP signature
Re: SQL-standard function body
On Tue, Mar 9, 2021 at 7:27 AM Peter Eisentraut wrote: > > > I see. The problem is that we don't have serialization and > deserialization support for most utility statements. I think I'll need > to add that eventually. For now, I have added code to prevent utility > statements. I think it's still useful that way for now. > Great! thanks! I found another problem when using CASE expressions: CREATE OR REPLACE FUNCTION foo_case() RETURNS boolean LANGUAGE SQL BEGIN ATOMIC select case when random() > 0.5 then true else false end; END; apparently the END in the CASE expression is interpreted as the END of the function -- Jaime Casanova Director de Servicios Profesionales SYSTEMGUARDS - Consultores de PostgreSQL
Re: Parallel INSERT (INTO ... SELECT ...)
On Mon, Mar 15, 2021 at 11:25:26AM +0530, Amit Kapila wrote: > On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com wrote: > > > > Attaching new version patch with this change. > > Thanks, the patch looks good to me. I have made some minor cosmetic > changes in the attached. I am planning to push this by tomorrow unless > you or others have any more comments or suggestions for this patch. I still wonder if it should be possible for the GUC to be off, and then selectively enable parallel inserts on specific tables. In that case, the GUC should be called something other than enable_*, or maybe it should be an enum with values like "off" and "allowed"/enabled/defer/??? -- Justin
Re: PITR promote bug: Checkpointer writes to older timeline
At Sun, 14 Mar 2021 17:59:59 +0900, Michael Paquier wrote in > On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote: > > read_local_xlog_page is *designed* to maintain ThisTimeLineID. > > Currently it doesn't seem utilized but I think it's sufficiently > > reasonable that the function maintains ThisTimeLineID. > > I don't quite follow this line of thoughts. ThisTimeLineID is > designed to remain 0 while recovery is running in most processes > (at the close exception of a WAL sender with a cascading setup, The reason for the "0" is they just aren't interested in the value. Checkpointer temporarily uses it then restore to 0 soon. > physical or logical, of course), so why is there any business for > read_local_xlog_page() to touch this field at all while in recovery to > begin with? Logical decoding stuff is (I think) designed to turn any backend into a walsender, which may need to maintain ThisTimeLineID. It seems to me that logical decoding stuff indents to maintain ThisTimeLineID of such backends at reading a WAL record. logical_read_xlog_page also updates ThisTimeLineID and pg_logical_slot_get_changes_guts(), pg_replication_slot_advance() (and maybe other functions) updates ThisTimeLineID. So it is natural that local_read_xlog_page() updates it since it is intended to be used used in logical decoding plugins. > I equally find confusing that XLogReadDetermineTimeline() relies on a > specific value of ThisTimeLineID in its own logic, while it clearly > states that all its callers have to read the current active TLI > beforehand. So I think that the existing logic is pretty weak, and > that resetting the field is an incorrect approach? It seems to me It is initialized by IndentifySystem(). And the logical walsender intends to maintain ThisTimeLineID by subsequent calls to GetStandbyFlushRecPtr(), which happen in logical_read_xlog_page(). > that we had better not change ThisTimeLineID actively while in > recovery in this code path and just let others do the job, like > RecoveryInProgress() once recovery finishes, or > GetStandbyFlushRecPtr() for a WAL sender. And finally, we should > store the current TLI used for replay in a separate variable that gets > passed down to the XLogReadDetermineTimeline() as argument. I agree that it's better that the replay TLI is stored in a separate variable. It is what I was complained on in the previous mails. (It might not have been so obvious, though..) > While going through it, I have simplified a bit the proposed TAP tests > (thanks for replacing the sleep() call, Soumyadeep. This would have > made the test slower for nothing on fast machines, and it would cause > failures on very slow machines). > > The attached fixes the original issue for me, keeping all the records > in their correct timeline. And I have not been able to break > cascading setups. If it happens that such cases actually break, we > have holes in our existing test coverage that should be improved. I > cannot see anything fancy missing on this side, though. > > Any thoughts? I don't think there's any acutual user of the function for the purpose, but.. Anyawy if we remove the update of ThisTimeLineID from read_local_xlog_page, I think we should remove or rewrite the following comment for the function. It no longer works as written in the catchphrase. > * Public because it would likely be very helpful for someone writing another > * output method outside walsender, e.g. in a bgworker. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Parallel INSERT (INTO ... SELECT ...)
On Fri, Mar 12, 2021 at 3:01 PM houzj.f...@fujitsu.com wrote: > > Attaching new version patch with this change. > Thanks, the patch looks good to me. I have made some minor cosmetic changes in the attached. I am planning to push this by tomorrow unless you or others have any more comments or suggestions for this patch. -- With Regards, Amit Kapila. v28-0001-Add-a-new-GUC-and-a-reloption-to-enable-inserts-.patch Description: Binary data
Re: Should we improve "PID XXXX is not a PostgreSQL server process" warning for pg_terminate_backend(<>)?
On 2021-03-07 19:16, Bharath Rupireddy wrote: On Fri, Feb 5, 2021 at 5:15 PM Bharath Rupireddy wrote: pg_terminate_backend and pg_cancel_backend with postmaster PID produce "PID is not a PostgresSQL server process" warning [1], which basically implies that the postmaster is not a PostgreSQL process at all. This is a bit misleading because the postmaster is the parent of all PostgreSQL processes. Should we improve the warning message if the given PID is postmasters' PID? +1. I felt it was a bit confusing when reviewing a thread[1]. If yes, how about a generic message for both of the functions - "signalling postmaster process is not allowed" or "cannot signal postmaster process" or some other better suggestion? [1] 2471176 ---> is postmaster PID. postgres=# select pg_terminate_backend(2471176); WARNING: PID 2471176 is not a PostgreSQL server process pg_terminate_backend -- f (1 row) postgres=# select pg_cancel_backend(2471176); WARNING: PID 2471176 is not a PostgreSQL server process pg_cancel_backend --- f (1 row) I'm attaching a small patch that emits a warning "signalling postmaster with PID %d is not allowed" for postmaster and "signalling PostgreSQL server process with PID %d is not allowed" for auxiliary processes such as checkpointer, background writer, walwriter. However, for stats collector and sys logger processes, we still get "PID X is not a PostgreSQL server process" warning because they don't have PGPROC entries(??). So BackendPidGetProc and AuxiliaryPidGetProc will not help and even pg_stat_activity is not having these processes' pid. I also ran into the same problem while creating a patch in [2]. I'm now wondering if changing the message to something like "PID is not a PostgreSQL backend process". "backend process' is now defined as "Process of an instance which acts on behalf of a client session and handles its requests." in Appendix. [1] https://www.postgresql.org/message-id/CALDaNm3ZzmFS-%3Dr7oDUzj7y7BgQv%2BN06Kqyft6C3xZDoKnk_6w%40mail.gmail.com [2] https://www.postgresql.org/message-id/0271f440ac77f2a4180e0e56ebd944d1%40oss.nttdata.com Regards, -- Atsushi Torikoshi NTT DATA CORPORATION
Re: Regression tests vs SERIALIZABLE
On Mon, Mar 15, 2021 at 6:14 PM Bharath Rupireddy wrote: > On Mon, Mar 15, 2021 at 9:54 AM Thomas Munro wrote: > > While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I > > noticed that select_parallel.sql and write_parallel.sql believe that > > (1) the tests are supposed to work with serializable as a default > > isolation level, and (2) parallelism would be inhibited by that, so > > they'd better use something else explicitly. Here's a patch to update > > that second thing in light of commit bb16aba5. I don't think it > > matters enough to bother back-patching it. > > +1, patch basically LGTM. I have one point - do we also need to remove > "begin isolation level repeatable read;" in aggreates.sql, explain.sql > and insert_parallel.sql? And in insert_parallel.sql, the comment also > says "Serializable isolation would disable parallel query", which is > not true after bb16aba5. Do we need to change that too? Yeah, you're right. That brings us to the attached. From 19be76b8903ad6f179463bc058474338e9d53f8c Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 15 Mar 2021 17:08:25 +1300 Subject: [PATCH v2] Parallel query regression tests don't need SERIALIZABLE workaround. SERIALIZABLE no longer inhibits parallelism, so the comment and workaround in various regression tests are obsolete since commit bb16aba5 (release 12). Also fix a typo. Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/CA%2BhUKGJUaHeK%3DHLATxF1JOKDjKJVrBKA-zmbPAebOM0Se2FQRg%40mail.gmail.com --- src/test/regress/expected/aggregates.out | 4 ++-- src/test/regress/expected/explain.out | 4 +--- src/test/regress/expected/insert_parallel.out | 4 +--- src/test/regress/expected/select_parallel.out | 4 +--- src/test/regress/expected/write_parallel.out | 6 ++ src/test/regress/sql/aggregates.sql | 4 ++-- src/test/regress/sql/explain.sql | 4 +--- src/test/regress/sql/insert_parallel.sql | 4 +--- src/test/regress/sql/select_parallel.sql | 4 +--- src/test/regress/sql/write_parallel.sql | 6 ++ 10 files changed, 14 insertions(+), 30 deletions(-) diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 2c818d9253..1ae0e5d939 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -2411,7 +2411,7 @@ ROLLBACK; -- Secondly test the case of a parallel aggregate combiner function -- returning NULL. For that use normal transition function, but a -- combiner function returning NULL. -BEGIN ISOLATION LEVEL REPEATABLE READ; +BEGIN; CREATE FUNCTION balkifnull(int8, int8) RETURNS int8 PARALLEL SAFE @@ -2453,7 +2453,7 @@ SELECT balk(hundred) FROM tenk1; ROLLBACK; -- test coverage for aggregate combine/serial/deserial functions -BEGIN ISOLATION LEVEL REPEATABLE READ; +BEGIN; SET parallel_setup_cost = 0; SET parallel_tuple_cost = 0; SET min_parallel_table_scan_size = 0; diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out index dc7ab2ce8b..791eba8511 100644 --- a/src/test/regress/expected/explain.out +++ b/src/test/regress/expected/explain.out @@ -293,9 +293,7 @@ rollback; -- actually get (maybe none at all), we can't examine the "Workers" output -- in any detail. We can check that it parses correctly as JSON, and then -- remove it from the displayed results. --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; set parallel_tuple_cost=0; diff --git a/src/test/regress/expected/insert_parallel.out b/src/test/regress/expected/insert_parallel.out index d5fae79031..3599c21eba 100644 --- a/src/test/regress/expected/insert_parallel.out +++ b/src/test/regress/expected/insert_parallel.out @@ -52,9 +52,7 @@ insert into test_data select * from generate_series(1,10); -- -- END: setup some tables and data needed by the tests. -- --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; set parallel_tuple_cost=0; diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out index 9b0c418db7..05ebcb284a 100644 --- a/src/test/regress/expected/select_parallel.out +++ b/src/test/regress/expected/select_parallel.out @@ -3,9 +3,7 @@ -- create function sp_parallel_restricted(int) returns int as $$begin return $1; end$$ language plpgsql parallel restricted; --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; set parallel_tuple_cost=0; diff --git a/src/test/regress/expected/write_parallel.out b/src/test/
Re: Regression tests vs SERIALIZABLE
On Mon, Mar 15, 2021 at 9:54 AM Thomas Munro wrote: > While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I > noticed that select_parallel.sql and write_parallel.sql believe that > (1) the tests are supposed to work with serializable as a default > isolation level, and (2) parallelism would be inhibited by that, so > they'd better use something else explicitly. Here's a patch to update > that second thing in light of commit bb16aba5. I don't think it > matters enough to bother back-patching it. +1, patch basically LGTM. I have one point - do we also need to remove "begin isolation level repeatable read;" in aggreates.sql, explain.sql and insert_parallel.sql? And in insert_parallel.sql, the comment also says "Serializable isolation would disable parallel query", which is not true after bb16aba5. Do we need to change that too? With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: Consider Parallelism While Planning For REFRESH MATERIALIZED VIEW
On Mon, Jan 4, 2021 at 9:05 PM Luc Vlaming wrote: > The new status of this patch is: Ready for Committer I think the comments above this might as well be removed, because they aren't very convincing: +-- Allow parallel planning of the underlying query for refresh materialized +-- view. We can be ensured that parallelism will be picked because of the +-- enforcement done at the beginning of the test. +refresh materialized view parallel_mat_view; If you just leave the REFRESH command, at least it'll be exercised, and I know you have a separate CF entry to add EXPLAIN support for REFRESH. So I'd just rip these weasel words out and then in a later commit you can add the EXPLAIN there where it's obviously missing. While reading some back history, I saw that commit e9baa5e9 introduced parallelism for CREATE M V, but REFRESH was ripped out of the original patch by Robert, who said: > The problem with a case like REFRESH MATERIALIZED VIEW is that there's > nothing to prevent something that gets run in the course of the query > from trying to access the view (and the heavyweight lock won't prevent > that, due to group locking). That's probably a stupid thing to do, > but it can't be allowed to break the world. The other cases are safe > from that particular problem because the table doesn't exist yet. Hmmm.
Re: A new function to wait for the backend exit after termination
On 2021/03/15 12:27, Bharath Rupireddy wrote: On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy wrote: Attaching v7 patch for further review. Attaching v8 patch after rebasing on to the latest master. Thanks for rebasing the patch! - WAIT_EVENT_XACT_GROUP_UPDATE + WAIT_EVENT_XACT_GROUP_UPDATE, + WAIT_EVENT_BACKEND_TERMINATION These should be listed in alphabetical order. In pg_wait_until_termination's do-while loop, ResetLatch() should be called. Otherwise, it would enter busy-loop after any signal arrives. Because the latch is kept set and WaitLatch() always exits immediately in that case. + /* +* Wait in steps of waittime milliseconds until this function exits or +* timeout. +*/ + int64 waittime = 10; 10 ms per cycle seems too frequent? + ereport(WARNING, + (errmsg("timeout cannot be negative or zero: %lld", + (long long int) timeout))); + + result = false; IMO the parameter should be verified before doing the actual thing. Why is WARNING thrown in this case? Isn't it better to throw ERROR like pg_promote() does? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: Procedures versus the "fastpath" API
On Tue, Mar 09, 2021 at 02:33:47PM -0500, Joe Conway wrote: > My vote would be reject using fastpath for procedures in all relevant > branches. > If someday someone cares enough to make it work, it is a new feature for a new > major release. FWIW, my vote would go for issuing an error if attempting to use a procedure in the fast path for all the branches. The lack of complaint about the error you are mentioning sounds like a pretty good argument to fail properly on existing branches, and work on this as a new feature in the future if there is anybody willing to make a case for it. -- Michael signature.asc Description: PGP signature
Re: Extensions not dumped when --schema is used
On Sun, Feb 21, 2021 at 08:14:45AM +0900, Michael Paquier wrote: > Okay, that sounds fine to me. Thanks for confirming. Guillaume, it has been a couple of weeks since your last update. Are you planning to send a new version of the patch? -- Michael signature.asc Description: PGP signature
Re: Improvements and additions to COPY progress reporting
On Wed, Mar 10, 2021 at 09:35:10AM +0100, Matthias van de Meent wrote: > There are examples in which pg_stat_progress_* -views report > inaccurate data. I think it is fairly reasonable to at least validate > some part of the progress reporting, as it is one of the few methods > for administrators to look at the state of currently running > administrative tasks, and as such, this user-visible api should be > validated. Looking closer at 0002, the size numbers are actually incorrect on Windows for the second query. The CRLFs at the end of each line of emp.data add three bytes to the report of COPY FROM, so this finishes with 82 bytes for bytes_total and bytes_processed instead of 79. Let's make this useful but simpler here, so I propose to check that the counters are higher than zero instead of an exact number. Let's also add the relation name relid::regclass while on it. The tests introduced are rather limited, but you are right that something is better than nothing here, and I have slightly updated what the tests sent previously as per the attached. What do you think? -- Michael diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source index a1d529ad36..9e4766898d 100644 --- a/src/test/regress/input/copy.source +++ b/src/test/regress/input/copy.source @@ -201,3 +201,65 @@ select * from parted_copytest where b = 1; select * from parted_copytest where b = 2; drop table parted_copytest; + +-- +-- Progress reporting for COPY +-- +create table tab_progress_reporting ( + name text, + age int4, + location point, + salary int4, + manager name +); + +-- Add a trigger to catch and print the contents of the catalog view +-- pg_stat_progress_copy during data insertion. This allows to test +-- the validation of some progress reports for COPY FROM where the trigger +-- would fire. +create function notice_after_tab_progress_reporting() returns trigger AS +$$ +declare report record; +begin + -- The fields ignored here are the ones that may not remain + -- consistent across multiple runs. The sizes reported may differ + -- across platforms, so just check if these are strictly positive. + with progress_data as ( + select + relid::regclass::text as relname, + command, + type, + bytes_processed > 0 as has_bytes_processed, + bytes_total > 0 as has_bytes_total, + tuples_processed, + tuples_excluded + from pg_stat_progress_copy + where pid = pg_backend_pid()) + select into report (to_jsonb(r)) as value + from progress_data r; + + raise info 'progress: %', report.value::text; + return new; +end; +$$ language plpgsql; + +create trigger check_after_tab_progress_reporting + after insert on tab_progress_reporting + for each statement + execute function notice_after_tab_progress_reporting(); + +-- Generate COPY FROM report with PIPE. +copy tab_progress_reporting from stdin; +sharon 25 (15,12) 1000 sam +sam 30 (10,5) 2000 bill +bill 20 (11,10) 1000 sharon +\. + +-- Generate COPY FROM report with FILE, with some excluded tuples. +truncate tab_progress_reporting; +copy tab_progress_reporting from '@abs_srcdir@/data/emp.data' + where (salary < 2000); + +drop trigger check_after_tab_progress_reporting on tab_progress_reporting; +drop function notice_after_tab_progress_reporting(); +drop table tab_progress_reporting; diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source index 938d3551da..8cf37dc257 100644 --- a/src/test/regress/output/copy.source +++ b/src/test/regress/output/copy.source @@ -165,3 +165,57 @@ select * from parted_copytest where b = 2; (1 row) drop table parted_copytest; +-- +-- Progress reporting for COPY +-- +create table tab_progress_reporting ( + name text, + age int4, + location point, + salary int4, + manager name +); +-- Add a trigger to catch and print the contents of the catalog view +-- pg_stat_progress_copy during data insertion. This allows to test +-- the validation of some progress reports for COPY FROM where the trigger +-- would fire. +create function notice_after_tab_progress_reporting() returns trigger AS +$$ +declare report record; +begin + -- The fields ignored here are the ones that may not remain + -- consistent across multiple runs. The sizes reported may differ + -- across platforms, so just check if these are strictly positive. + with progress_data as ( + select + relid::regclass::text as relname, + command, + type, + bytes_processed > 0 as has_bytes_processed, + bytes_total > 0 as has_bytes_total, + tuples_processed, + tuples_excluded + from pg_stat_progress_copy + where pid = pg_backend_pid()) + select into report (to_jsonb(r)) as value + from progress_data r; + + raise info 'progress: %', report.value::text; + return new; +end; +$$ language plpgsql; +create trigger check_after_tab_progress_reporting + after insert on tab_progress_reporting + for each statement + execute function notice_after_tab_progress_reporting(); +-- Generate CO
Re: pl/pgsql feature request: shorthand for argument and local variable references
po 15. 3. 2021 v 4:54 odesílatel Julien Rouhaud napsal: > On Sun, Mar 14, 2021 at 08:41:15PM +0100, Pavel Stehule wrote: > > > > My English is good enough for taking beer everywhere in the world :) . Ti > > is not good, but a lot of people who don't understand to English > understand > > my simplified fork of English language. > > I have the same problem. We can talk about it and other stuff while > having a > beer sometimes :) > :) > > updated patch attached > > Thanks, it looks good to me, I'm marking the patch as RFC! > Thank you very much Regards Pavel
Re: Change JOIN tutorial to focus more on explicit joins
po 15. 3. 2021 v 3:48 odesílatel Thomas Munro napsal: > On Thu, Mar 11, 2021 at 2:06 AM David Steele wrote: > > On 12/1/20 3:38 AM, Jürgen Purtz wrote: > > > OK. Patch attached. > > +Queries which access multiple tables (including repeats) at once are > called > > I'd write "Queries that" here (that's is a transatlantic difference in > usage; I try to proofread these things in American mode for > consistency with the rest of the language in this project, which I > probably don't entirely succeed at but this one I've learned...). > > Maybe instead of "(including repeats)" it could say "(or multiple > instances of the same table)"? > > +For example, to return all the weather records together with the > location of the > +associated city, the database compares the > city > column of each row of the weather table with > the > name column of all rows in the > cities > table, and select the pairs of rows where these values match. > > Here "select" should agree with "the database" and take an -s, no? > > +This syntax pre-dates the JOIN and > ON > +keywords. The tables are simply listed in the > FROM, > +comma-separated, and the comparison expression added to the > +WHERE clause. > > Could we mention SQL92 somewhere? Like maybe "This syntax pre-dates > the JOIN and ON keywords, which were introduced by SQL-92". (That's a > "non-restrictive which", I think the clue is the comma?) > previous syntax should be mentioned too. An reader can find this syntax thousands applications Pavel >
Regression tests vs SERIALIZABLE
Hi, While reviewing the patch for parallel REFRESH MATERIALIZED VIEW, I noticed that select_parallel.sql and write_parallel.sql believe that (1) the tests are supposed to work with serializable as a default isolation level, and (2) parallelism would be inhibited by that, so they'd better use something else explicitly. Here's a patch to update that second thing in light of commit bb16aba5. I don't think it matters enough to bother back-patching it. However, since commit 862ef372d6b, there *is* one test that fails if you run make installcheck against a cluster running with -c default_transaction_isolation=serializable: transaction.sql. Is that a mistake? Is it a goal to be able to run this test suite against all 3 isolation levels? @@ -1032,7 +1032,7 @@ SHOW transaction_isolation; -- out of transaction block transaction_isolation --- - read committed + serializable (1 row) From 21115a0721aedec9c39a1d41bc38ec2b96960ff1 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Mon, 15 Mar 2021 17:08:25 +1300 Subject: [PATCH] Parallel regression tests don't need workaround for SERIALIZABLE. SERIALIZABLE no longer inhibits parallelism, so the comment and workaround in {select,write}_parallel.sql are obsolete since commit bb16aba5 (release 12). Also fix a nearby typo. --- src/test/regress/expected/select_parallel.out | 4 +--- src/test/regress/expected/write_parallel.out | 6 ++ src/test/regress/sql/select_parallel.sql | 4 +--- src/test/regress/sql/write_parallel.sql | 6 ++ 4 files changed, 6 insertions(+), 14 deletions(-) diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out index 9b0c418db7..05ebcb284a 100644 --- a/src/test/regress/expected/select_parallel.out +++ b/src/test/regress/expected/select_parallel.out @@ -3,9 +3,7 @@ -- create function sp_parallel_restricted(int) returns int as $$begin return $1; end$$ language plpgsql parallel restricted; --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; set parallel_tuple_cost=0; diff --git a/src/test/regress/expected/write_parallel.out b/src/test/regress/expected/write_parallel.out index 0c4da2591a..77705f9a70 100644 --- a/src/test/regress/expected/write_parallel.out +++ b/src/test/regress/expected/write_parallel.out @@ -1,16 +1,14 @@ -- -- PARALLEL -- --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; set parallel_tuple_cost=0; set min_parallel_table_scan_size=0; set max_parallel_workers_per_gather=4; -- --- Test write operations that has an underlying query that is eligble +-- Test write operations that has an underlying query that is eligible -- for parallel plans -- explain (costs off) create table parallel_write as diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql index 5a01a98b26..d31e290ec2 100644 --- a/src/test/regress/sql/select_parallel.sql +++ b/src/test/regress/sql/select_parallel.sql @@ -5,9 +5,7 @@ create function sp_parallel_restricted(int) returns int as $$begin return $1; end$$ language plpgsql parallel restricted; --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; diff --git a/src/test/regress/sql/write_parallel.sql b/src/test/regress/sql/write_parallel.sql index 78b479cedf..a5d63112c9 100644 --- a/src/test/regress/sql/write_parallel.sql +++ b/src/test/regress/sql/write_parallel.sql @@ -2,9 +2,7 @@ -- PARALLEL -- --- Serializable isolation would disable parallel query, so explicitly use an --- arbitrary other level. -begin isolation level repeatable read; +begin; -- encourage use of parallel plans set parallel_setup_cost=0; @@ -13,7 +11,7 @@ set min_parallel_table_scan_size=0; set max_parallel_workers_per_gather=4; -- --- Test write operations that has an underlying query that is eligble +-- Test write operations that has an underlying query that is eligible -- for parallel plans -- explain (costs off) create table parallel_write as -- 2.30.1
Re: pl/pgsql feature request: shorthand for argument and local variable references
On Sun, Mar 14, 2021 at 08:41:15PM +0100, Pavel Stehule wrote: > > My English is good enough for taking beer everywhere in the world :) . Ti > is not good, but a lot of people who don't understand to English understand > my simplified fork of English language. I have the same problem. We can talk about it and other stuff while having a beer sometimes :) > updated patch attached Thanks, it looks good to me, I'm marking the patch as RFC!
Re: REINDEX backend filtering
Please find attached v7, with the following changes: - all typo reported by Michael and Mark are fixed - REINDEX (OUTDATED) INDEX will now ignore the index if it doesn't have any outdated dependency. Partitioned index are correctly handled. - REINDEX (OUTDATED, VERBOSE) will now inform caller of ignored indexes, with lines of the form: NOTICE: index "index_name" has no outdated dependency - updated regression tests to cover all those changes. I kept the current approach of using simple SQL test listing the ignored indexes. I also added some OUDATED option to collate.icu.utf8 tests so that we also check that both REINDEX and REINDEX(OUTDATED) work as expected. - move pg_index_has_outdated_dependency to 0002 I didn't remove index_has_outdated_collation() for now. >From 91bcd6e4565164314eb6444635ca274695de3748 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Thu, 3 Dec 2020 15:54:42 +0800 Subject: [PATCH v7 1/2] Add a new OUTDATED filtering facility for REINDEX command. OUTDATED is added a new unreserved keyword. When used, REINDEX will only process indexes that have an outdated dependency. For now, only dependency on collations are supported but we'll likely support other kind of dependency in the future. Author: Julien Rouhaud Reviewed-by: Michael Paquier, Mark Dilger Discussion: https://postgr.es/m/20201203093143.GA64934%40nol --- doc/src/sgml/ref/reindex.sgml | 12 +++ src/backend/access/index/indexam.c| 59 ++ src/backend/catalog/index.c | 79 ++- src/backend/commands/indexcmds.c | 37 - src/backend/parser/gram.y | 4 +- src/backend/utils/cache/relcache.c| 47 +++ src/bin/psql/tab-complete.c | 2 +- src/include/access/genam.h| 1 + src/include/catalog/index.h | 3 + src/include/parser/kwlist.h | 1 + src/include/utils/relcache.h | 1 + .../regress/expected/collate.icu.utf8.out | 12 ++- src/test/regress/expected/create_index.out| 27 +++ src/test/regress/sql/collate.icu.utf8.sql | 12 ++- src/test/regress/sql/create_index.sql | 18 + 15 files changed, 301 insertions(+), 14 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index ff4dba8c36..c4749c338b 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -26,6 +26,7 @@ REINDEX [ ( option [, ...] ) ] { IN where option can be one of: CONCURRENTLY [ boolean ] +OUTDATED [ boolean ] TABLESPACE new_tablespace VERBOSE [ boolean ] @@ -188,6 +189,17 @@ REINDEX [ ( option [, ...] ) ] { IN + +OUTDATED + + + This option can be used to filter the list of indexes to rebuild and only + process indexes that have outdated dependencies. Fow now, the only + dependency handled currently is the collation provider version. + + + + TABLESPACE diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index 3d2dbed708..dc1c85cf0d 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -145,6 +145,65 @@ index_open(Oid relationId, LOCKMODE lockmode) return r; } +/* + * try_index_open - open an index relation by relation OID + * + * Same as index_open, except return NULL instead of failing + * if the index does not exist. + * + */ +Relation +try_index_open(Oid relationId, LOCKMODE lockmode) +{ + Relation r; + + Assert(lockmode >= NoLock && lockmode < MAX_LOCKMODES); + + /* Get the lock first */ + if (lockmode != NoLock) + LockRelationOid(relationId, lockmode); + + /* + * Now that we have the lock, probe to see if the relation really exists + * or not. + */ + if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relationId))) + { + /* Release useless lock */ + if (lockmode != NoLock) + UnlockRelationOid(relationId, lockmode); + + return NULL; + } + + /* Should be safe to do a relcache load */ + r = RelationIdGetRelation(relationId); + + if (!RelationIsValid(r)) + elog(ERROR, "could not open relation with OID %u", relationId); + + /* If we didn't get the lock ourselves, assert that caller holds one */ + Assert(lockmode != NoLock || + CheckRelationLockedByMe(r, AccessShareLock, true)); + + if (r->rd_rel->relkind != RELKIND_INDEX && + r->rd_rel->relkind != RELKIND_PARTITIONED_INDEX) + { + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not an index", + RelationGetRelationName(r; + } + + /* Make note that we've accessed a temporary relation */ + if (RelationUsesLocalBuffers(r)) + MyXactFlags |= XACT_FLAGS_ACCESSEDTEMPNAMESPACE; + + pgstat_initstats(r); + + return r; +} + /* * index_close - close an index relation * diff --git a/src/backend/catalog/index.c b/src/b
Re: EXPLAIN/EXPLAIN ANALYZE REFRESH MATERIALIZED VIEW
On Sat, Mar 13, 2021 at 7:00 AM Japin Li wrote: > > On Mon, 08 Mar 2021 at 12:28, Bharath Rupireddy > wrote: > > On Sun, Mar 7, 2021 at 10:13 PM Zhihong Yu wrote: > >> Hi, > >> > >> +* EXPLAIN ANALYZE CREATE TABLE AS or REFRESH MATERIALIZED VIEW > >> +* WITH NO DATA is weird. > >> > >> Maybe it is clearer to spell out WITH NO DATA for both statements, instead > >> of sharing it. > > > > Done that way. > > > >> - if (!stmt->skipData) > >> + if (!stmt->skipData && !explainInfo) > >> ... > >> + else if (explainInfo) > >> > >> It would be cleaner to put the 'if (explainInfo)' as the first check. That > >> way, the check for skipData can be simplified. > > > > Changed. > > > > Thanks for review comments. Attaching v7 patch set with changes only > > in 0002 patch. Please have a look. > > > > The v7 patch looks good to me, and there is no other advice, so I change > the status to "Ready for Committer". Thanks for the review. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com
Re: A new function to wait for the backend exit after termination
On Sun, Mar 7, 2021 at 2:39 PM Bharath Rupireddy wrote: > Attaching v7 patch for further review. Attaching v8 patch after rebasing on to the latest master. With Regards, Bharath Rupireddy. EnterpriseDB: http://www.enterprisedb.com v8-0001-pg_terminate_backend-with-wait-and-timeout.patch Description: Binary data
Re: recovering from "found xmin ... from before relfrozenxid ..."
On Mon, Sep 21, 2020 at 1:11 PM Andres Freund wrote: > > I'm not sure I really understand how that's happening, because surely > > HOT chains and non-HOT chains are pruned by the same code, but it > > doesn't sound good. > > Not necessarily, unfortunately: > > case HEAPTUPLE_DEAD: > So if e.g. a transaction aborts between the heap_page_prune and this > check the pruning behaviour depends on whether the tuple is part of a > HOT chain or not. I have a proposal that includes removing this "tupgone = true" special case: https://postgr.es/m/CAH2-Wzm7Y=_g3fjvhv7a85afubusydggdneqn1hodveoctl...@mail.gmail.com Of course this won't change the fact that vacuumlazy.c can disagree with pruning about what is dead -- that is a necessary consequence of having multiple HTSV calls for the same tuple in vacuumlazy.c (it can change in the presence of concurrently aborted transactions). But removing the "tupgone = true" special case does seem much more consistent, and simpler overall. We have lots of code that is only needed to make that special case work. For example, the whole idea of a dedicated XLOG_HEAP2_CLEANUP_INFO record for recovery conflicts can go -- we can get by with only XLOG_HEAP2_CLEAN records from pruning, IIRC. Have I missed something? The special case in question seems pretty awful to me, so I have to wonder why somebody else didn't remove it long ago... -- Peter Geoghegan
Re: [HACKERS] logical decoding of two-phase transactions
On Fri, Mar 12, 2021 at 2:09 PM Peter Smith wrote: > > Please find attached the latest patch set v58* > In this patch-series, I see a problem with synchronous replication when GUC 'synchronous_standby_names' is configured to use subscriber. This will allow Prepares and Commits to wait for the subscriber to finish. Before this patch, we never send prepare as two-phase was not enabled by a subscriber, so it won't wait for it, rather it will make progress because we send keep_alive messages. But after this patch, it will start waiting for Prepare to finish. Now, without spool-file logic, it will work because prepares are decoded on subscriber and a corresponding ack will be sent to a publisher but for the spool-file case, we will wait for Publisher to send commit prepared and in publisher prepare is not finished because we are waiting for its ack. So, it will create a sort of deadlock. This is related to the problem as mentioned in the below comments in the patch: + * A future release may be able to detect when all tables are READY and set + * a flag to indicate this subscription/slot is ready for two_phase + * decoding. Then at the publisher-side, we could enable wait-for-prepares + * only when all the slots of WALSender have that flag set. The difference is that it can happen now itself, prepares automatically wait if 'synchronous_standby_names' is set. Now, we can imagine a solution where after spooling to file the changes which can't be applied during syncup phase, we update the flush location so that publisher can proceed with that prepare. But I think that won't work because once we have updated the flush location those prepares won't be sent again and it is quite possible that we don't have complete relation information as the schema is not sent with each transaction. Now, we can go one step further and try to remember the schema information the first time it is sent so that it can be reused after restart but I think that will complicate the patch and overall design. I think there is a simpler solution to these problems. The idea is to enable two_phase after the initial sync is over (all relations are in a READY state). If we switch-on the 2PC only after all the relations come to the READY state then we shouldn't get any prepare before sync-point. However, it is quite possible that before reaching syncpoint, the slot corresponding to apply-worker has skipped because 2PC was not enabled, and afterward, prepare would be skipped because by that start_decoding_at might have moved. See the explanation in an email: https://www.postgresql.org/message-id/CAA4eK1LuK4t-ZYYCY7k9nMoYP%2Bdwi-JyqUdtcffQMoB_g5k6Hw%40mail.gmail.com. Now, even the initial_consistent_point won't help because for apply-worker, it will be different from tablesync slot's initial_consistent_point and we would have reached initial consistency earlier for apply-workers. To solve the main problem (how to detect the prepares that are skipped when we toggled the two_pc option) in the above idea, we can mark an LSN position in the slot (two_phase_at, this will be the same as start_decoding_at point when we receive slot with 2PC option) where we enable two_pc. If we encounter any commit prepared whose prepare LSN is less than two_phase_at, then we need to send prepare for the transaction along with commit prepared. For this solution on the subscriber-side, I think we need a tri-state column (two_phase) in pg_subscription. It can have three values 'disable', 'can_enable', 'enable'. By default, it will be 'disable'. If the user enables 2PC, then we can set it to 'can_enable' and once we see all relations are in a READY state, restart the apply-worker and this time while starting the streaming, send the two_pc option and then we can change the state to 'enable' so that future restarts won't send this option again. Now on the publisher side, if this option is present, it will change the value of two_phase_at in the slot to start_decoding_at. I think something on these lines should be much easier than the spool-file implementation unless we see any problem with this idea. -- With Regards, Amit Kapila.
Re: Change JOIN tutorial to focus more on explicit joins
On Thu, Mar 11, 2021 at 2:06 AM David Steele wrote: > On 12/1/20 3:38 AM, Jürgen Purtz wrote: > > OK. Patch attached. +Queries which access multiple tables (including repeats) at once are called I'd write "Queries that" here (that's is a transatlantic difference in usage; I try to proofread these things in American mode for consistency with the rest of the language in this project, which I probably don't entirely succeed at but this one I've learned...). Maybe instead of "(including repeats)" it could say "(or multiple instances of the same table)"? +For example, to return all the weather records together with the location of the +associated city, the database compares the city column of each row of the weather table with the name column of all rows in the cities table, and select the pairs of rows where these values match. Here "select" should agree with "the database" and take an -s, no? +This syntax pre-dates the JOIN and ON +keywords. The tables are simply listed in the FROM, +comma-separated, and the comparison expression added to the +WHERE clause. Could we mention SQL92 somewhere? Like maybe "This syntax pre-dates the JOIN and ON keywords, which were introduced by SQL-92". (That's a "non-restrictive which", I think the clue is the comma?)
Re: Postgres crashes at memcopy() after upgrade to PG 13.
On Sun, Mar 14, 2021 at 6:54 PM Avinash Kumar wrote: > Following may be helpful to understand what I meant. > > I have renamed the table and index names before adding it here. It should be possible to run amcheck on your database, which will detect corrupt posting list tuples on Postgres 13. It's a contrib extension, so you must first run "CREATE EXTENSION amcheck;". From there, you can run a query like the following (you may want to customize this): SELECT bt_index_parent_check(index => c.oid, heapallindexed => true), c.relname, c.relpages FROM pg_index i JOIN pg_opclass op ON i.indclass[0] = op.oid JOIN pg_am am ON op.opcmethod = am.oid JOIN pg_class c ON i.indexrelid = c.oid JOIN pg_namespace n ON c.relnamespace = n.oid WHERE am.amname = 'btree' -- Don't check temp tables, which may be from another session: AND c.relpersistence != 't' -- Function may throw an error when this is omitted: AND c.relkind = 'i' AND i.indisready AND i.indisvalid ORDER BY c.relpages DESC; If this query takes too long to complete you may find it useful to add something to limit the indexes check, such as: AND n.nspname = 'public' -- that change to the SQL will make the query just test indexes from the public schema. Do "SET client_min_messages=DEBUG1 " to get a kind of rudimentary progress indicator, if that seems useful to you. The docs have further information on what this bt_index_parent_check function does, should you need it: https://www.postgresql.org/docs/13/amcheck.html -- Peter Geoghegan
Re: New IndexAM API controlling index vacuum strategies
On Thu, Mar 11, 2021 at 8:31 AM Robert Haas wrote: > But even if not, I'm not sure this > helps much with the situation you're concerned about, which involves > non-HOT tuples. Attached is a POC-quality revision of Masahiko's skip_index_vacuum.patch [1]. There is an improved design for skipping index vacuuming (improved over the INDEX_CLEANUP stuff from Postgres 12). I'm particularly interested in your perspective on this refactoring stuff, Robert, because you ran into the same issues after initial commit of the INDEX_CLEANUP reloption feature [2] -- you ran into issues with the "tupgone = true" special case. This is the case where VACUUM considers a tuple dead that was not marked LP_DEAD by pruning, and so needs to be killed in the second heap scan in lazy_vacuum_heap() instead. You'll recall that these issues were fixed by your commit dd695979888 from May 2019. I think that we need to go further than you did in dd695979888 for this -- we ought to get rid of the special case entirely. ISTM that any new code that skips index vacuuming really ought to be structured as a dynamic version of the "VACUUM (INDEX_CLEANUP OFF)" mechanism. Or vice versa. The important thing is to recognize that they're essentially the same thing, and to structure the code such that they become exactly the same mechanism internally. That's not trivial right now. But removing the awful "tupgone = true" special case seems to buy us a lot -- it makes unifying everything relatively straightforward. In particular, it makes it possible to delay the decision to vacuum indexes until the last moment, which seems essential to making index vacuuming optional. And so I have removed the tupgone/XLOG_HEAP2_CLEANUP_INFO crud in the patch -- that's what all of the changes relate to. This results in a net negative line count, which is a nice bonus! I've CC'd Noah, because my additions to this revision (of Masahiko's patch) are loosely based on an abandoned 2013 patch from Noah [3] -- Noah didn't care for the "tupgone = true" special case either. I think that it's fair to say that Tom doesn't much care for it either [4], or at least was distressed by its lack of test coverage as of a couple of years ago -- which is a problem that still exists today. Honestly, I'm surprised that somebody else hasn't removed the code in question already, long ago -- what possible argument can be made for it now? This patch makes the "VACUUM (INDEX_CLEANUP OFF)" mechanism no longer get invoked as if it was like the "no indexes on table so do it all in one heap pass" optimization. This seems a lot clearer -- INDEX_CLEANUP OFF isn't able to call lazy_vacuum_page() at all (for the obvious reason), so any similarity between the two cases was always superficial -- skipping index vacuuming should not be confused with doing a one-pass VACUUM/having no indexes at all. The original INDEX_CLEANUP structure (from commits a96c41fe and dd695979) always seemed confusing to me for this reason, FWIW. Note that I've merged multiple existing functions in vacuumlazy.c into one: the patch merges lazy_vacuum_all_indexes() and lazy_vacuum_heap() into a single function named vacuum_indexes_mark_unused() (note also that lazy_vacuum_page() has been renamed to mark_unused_page() to reflect the fact that it is now strictly concerned with making LP_DEAD line pointers LP_UNUSED). The big idea is that there is one choke point that decides whether index vacuuming is needed at all at one point in time, dynamically. vacuum_indexes_mark_unused() decides this for us at the last moment. This can only happen during a VACUUM that has enough memory to fit all TIDs -- otherwise we won't skip anything dynamically. We may in the future add additional criteria for skipping index vacuuming. That can now just be added to the beginning of this new vacuum_indexes_mark_unused() function. We may even teach vacuum_indexes_mark_unused() to skip some indexes but not others in a future release, a possibility that was already discussed at length earlier in this thread. This new structure has all the context it needs to do all of these things. I wonder if we can add some kind of emergency anti-wraparound vacuum logic to what I have here, for Postgres 14. Can we come up with logic that has us skip index vacuuming because XID wraparound is on the verge of causing an outage? That seems like a strategically important thing for Postgres, so perhaps we should try to get something like that in. Practically every post mortem blog post involving Postgres also involves anti-wraparound vacuum. One consequence of my approach is that we now call lazy_cleanup_all_indexes(), even when we've skipped index vacuuming itself. We should at least "check-in" with the indexes IMV. To an index AM, this will be indistinguishable from a VACUUM that never had tuples for it to delete, and so never called ambulkdelete() before calling amvacuumcleanup(). This seems logical to me: why should there be any significant behavioral divergence betwee
Re: cleanup temporary files after crash
On Wed, Mar 10, 2021 at 1:31 AM Michael Paquier wrote: > On Tue, Mar 09, 2021 at 02:28:43AM +0100, Tomas Vondra wrote: > > Let's move this patch forward. Based on the responses, I agree the > > default behavior should be to remove the temp files, and I think we > > should have the GUC (on the off chance that someone wants to preserve > > the temporary files for debugging or whatever other reason). > > Thanks for taking care of this. I am having some second-thoughts > about changing this behavior by default, still that's much more useful > this way. +1 for having it on by default. I was also just looking at this patch and came here to say LGTM except for two cosmetic things, below. > > I propose to rename the GUC to remove_temp_files_after_crash, I think > > "remove" is a bit clearer than "cleanup". I've also reworded the sgml > > docs a little bit. > > "remove" sounds fine to me. +1 > > Attached is a patch with those changes. Barring objections, I'll get > > this committed in the next couple days. > > +When set to on, PostgreSQL will > automatically > Nit: using a markup for the "on" value. Maybe should say "which is the default", like other similar things? > +#remove_temp_files_after_crash = on# remove temporary files after > +# # backend crash? > The indentation of the second line is incorrect here (Incorrect number > of spaces in tabs perhaps?), and there is no need for the '#' at the > beginning of the line. Yeah, that's wrong. For some reason that one file uses a tab size of 8, unlike the rest of the tree (I guess because people will read that file in software with the more common setting of 8). If you do :set tabstop=8 in vim, suddenly it all makes sense, but it is revealed that this patch has it wrong, as you said. (Perhaps this file should have some of those special Vim/Emacs control messages so we don't keep getting this wrong?)
Re: Postgres crashes at memcopy() after upgrade to PG 13.
Hi, On Sun, Mar 14, 2021 at 10:17 PM Thomas Munro wrote: > [Dropping pgsql-general@ from the CC, because cross-posting triggers > moderation; sorry I didn't notice that on my first reply] > > On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar > wrote: > > On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar < > avinash.vallar...@gmail.com> wrote: > >> Also the datatype is bigint > > Ok. Collation changes are the most common cause of index problems > when upgrading OSes, but here we can rule that out if your index is on > bigint. So it seems like this is some other kind of corruption in > your database, or a bug in the deduplication code. > I suspect the same. When i tried to perform a pg_filedump to see the entry of the ID in the index, it was strange that the entry did not exist in the Index. But, the SELECT using an Index only scan was still working okay. I have chosen the start and end page perfectly and there should not be any mistake there. Following may be helpful to understand what I meant. I have renamed the table and index names before adding it here. =# select pg_size_pretty(pg_relation_size('idx_id_mtime')) as size, relpages from pg_class where relname = 'idx_id_mtime'; size | relpages ---+-- 71 MB | 8439 =# select pg_relation_filepath('idx_id_mtime'); pg_relation_filepath -- base/16404/346644309 =# \d+ idx_id_mtime Index "public.idx_id_mtime" Column | Type | Key? | Definition | Storage | Stats target ---+--+--++-+-- sometable_id | bigint | yes | sometable_id | plain | mtime | timestamp with time zone | yes | mtime | plain | btree, for table "public.sometable" $ pg_filedump -R 1 8439 -D bigint,timestamp /flash/berta13/base/16404/346644309 > 12345.txt $ cat 12345.txt | grep -w 70334 --> No Output. We don't see the entry for the ID : 70334 in the output of pg_filedump. *But, the SELECT statement is still using the same Index. * =*# EXPLAIN select * from sometable where sometable_id = 70334; QUERY PLAN Index Scan using idx_id_mtime on sometable (cost=0.43..2.45 rows=1 width=869) Index Cond: (sometable_id = 70334) (2 rows) =*# EXPLAIN ANALYZE select * from sometable where sometable_id = 70334; QUERY PLAN -- Index Scan using idx_id_mtime on sometable (cost=0.43..2.45 rows=1 width=869) (actual time=0.166..0.168 rows=1 loops=1) Index Cond: (sometable_id = 70334) Planning Time: 0.154 ms Execution Time: 0.195 ms (4 rows) =*# update sometable set sometable_id = 70334 where sometable_id = 70334; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?> Now, let us see the next ID. Here, the entry is visible in the output of pg_filedump. $ cat 12345.txt | grep -w 10819 COPY: 10819 2018-03-21 15:16:41.202277 The update still fails with the same error. =*# update sometable set sometable_id = 10819 where sometable_id = 10819; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !?>
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021/03/11 21:29, Masahiro Ikeda wrote: On 2021-03-11 11:52, Fujii Masao wrote: On 2021/03/11 9:38, Masahiro Ikeda wrote: On 2021-03-10 17:08, Fujii Masao wrote: IIUC the stats collector basically exits after checkpointer and walwriter exit. But there seems no guarantee that the stats collector processes all the messages that other processes have sent during the shutdown of the server. Thanks, I understood the above postmaster behaviors. PMState manages the status and after checkpointer is exited, the postmaster sends SIGQUIT signal to the stats collector if the shutdown mode is smart or fast. (IIUC, although the postmaster kill the walsender, the archiver and the stats collector at the same time, it's ok because the walsender and the archiver doesn't send stats to the stats collector now.) But, there might be a corner case to lose stats sent by background workers like the checkpointer before they exit (although this is not implemented yet.) For example, 1. checkpointer send the stats before it exit 2. stats collector receive the signal and break before processing the stats message from checkpointer. In this case, 1's message is lost. 3. stats collector writes the stats in the statsfiles and exit Why don't you recheck the coming message is zero just before the 2th procedure? (v17-0004-guarantee-to-collect-last-stats-messages.patch) Yes, I was thinking the same. This is the straight-forward fix for this issue. The stats collector should process all the outstanding messages when normal shutdown is requested, as the patch does. On the other hand, if immediate shutdown is requested or emergency bailout (by postmaster death) is requested, maybe the stats collector should skip those processings and exit immediately. But if we implement that, we would need to teach the stats collector the shutdown type (i.e., normal shutdown or immediate one). Because currently SIGQUIT is sent to the collector whichever shutdown is requested, and so the collector cannot distinguish the shutdown type. I'm afraid that change is a bit overkill for now. BTW, I found that the collector calls pgstat_write_statsfiles() even at emergency bailout case, before exiting. It's not necessary to save the stats to the file in that case because subsequent server startup does crash recovery and clears that stats file. So it's better to make the collector exit immediately without calling pgstat_write_statsfiles() at emergency bailout case? Probably this should be discussed in other thread because it's different topic from the feature we're discussing here, though. IIUC, only the stats collector has another hander for SIGQUIT although other background processes have a common hander for it and just call _exit(2). I thought to guarantee when TerminateChildren(SIGTERM) is invoked, don't make stats collector shutdown before other background processes are shutdown. I will make another thread to discuss that the stats collector should know the shutdown type or not. If it should be, it's better to make the stats collector exit as soon as possible if the shutdown type is an immediate, and avoid losing the remaining stats if it's normal. +1 I researched the past discussion related to writing the stats files when the immediate shutdown is requested. And I found the following thread([1]) although the discussion is stopped on 12/1/2016. IIUC, the thread's consensus are (1) To kill the stats collector soon before writing the stats file is needed in some case because there is a possibility that it takes a long time until the failover happens. The possible reasons are that disk write speed is slow, stats files are big, and so on. (2) It needs to change the behavior from removing all stats files when the startup does crash recovery because autovacuum uses the stats. (3) It's ok that the stats collector exit without calling pgstat_write_statsfiles() if the stats file is written every X minutes (using wal or another mechanism) and startup process can restore the stats with slightly low freshness. (4) It needs to find the way how to handle the (2)'s stats file when deleting on PITR rewind or stats collector crash happens. So, I need to ping the threads. But I don't have any idea to handle (4) yet... [1] https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F5EF25A%40G01JPEXMBYT05 I measured the timing of the above in my linux laptop using v17-measure-timing.patch. I don't have any strong opinion to handle this case since this result shows to receive and processes the messages takes too short time (less than 1ms) although the stats collector receives the shutdown signal in 5msec(099->104) after the checkpointer process exits. Agreed. ``` 1615421204.556 [checkpointer] DEBUG: received shutdown request signal 1615421208.099 [checkpointer] DEBUG: proc_exit(-1): 0 callbacks to make # exit and send the mess
Re: About to add WAL write/fsync statistics to pg_stat_wal view
On 2021-03-12 12:39, Fujii Masao wrote: On 2021/03/11 21:29, Masahiro Ikeda wrote: In addition, I rebased the patch for WAL receiver. (v17-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) Thanks! Will review this later. Thanks a lot! I read through the 0003 patch. Here are some comments for that. With the patch, walreceiver's stats are counted as wal_write, wal_sync, wal_write_time and wal_sync_time in pg_stat_wal. But they should be counted as different columns because WAL IO is different between walreceiver and other processes like a backend? For example, open_sync or open_datasync is chosen as wal_sync_method, those other processes use O_DIRECT flag to open WAL files, but walreceiver does not. For example, those other procesess write WAL data in block units, but walreceiver does not. So I'm concerned that mixing different WAL IO stats in the same columns would confuse the users. Thought? I'd like to hear more opinions about how to expose walreceiver's stats to users. Thanks, I understood get_sync_bit() checks the sync flags and the write unit of generated wal data and replicated wal data is different. (It's interesting optimization whether to use kernel cache or not.) OK. Although I agree to separate the stats for the walrecever, I want to hear opinions from other people too. I didn't change the patch. Please feel to your comments. +int +XLogWriteFile(int fd, const void *buf, size_t nbyte, off_t offset) This common function writes WAL data and measures IO timing. IMO we can refactor the code furthermore by making this function handle the case where pg_write() reports an error. In other words, I think that the function should do what do-while loop block in XLogWrite() does. Thought? OK. I agree. I wonder to change the error check ways depending on who calls this function? Now, only the walreceiver checks (1)errno==0 and doesn't check (2)errno==ENITR. Other processes are the opposite. IIUC, it's appropriate that every process checks (1)(2). Please let me know my understanding is wrong. BTW, currently XLogWrite() increments IO timing even when pg_pwrite() reports an error. But this is useless. Probably IO timing should be incremented after the return code of pg_pwrite() is checked, instead? Yes, I agree. I fixed it. (v18-0003-Makes-the-wal-receiver-report-WAL-statistics.patch) BTW, thanks for your comments in person that the bgwriter process will generate wal data. I checked that it generates the WAL to take a snapshot via LogStandySnapshot(). I attached the patch for bgwriter to send the wal stats. (v18-0005-send-stats-for-bgwriter-when-shutdown.patch) This patch includes the following changes. (1) introduce pgstat_send_bgwriter() the mechanism to send the stats if PGSTAT_STAT_INTERVAL msec has passed like pgstat_send_wal() to avoid overloading to stats collector because "bgwriter_delay" can be set for 10msec or more. (2) remove pgstat_report_wal() and integrate with pgstat_send_wal() because bgwriter sends WalStats.m_wal_records and to avoid overloading (see (1)). IIUC, although the pros to separate them is to reduce the calculation cost of WalUsageAccumDiff(), the impact is limited. (3) make a new signal handler for bgwriter to force sending remaining stats during shutdown because of (1) and remove HandleMainLoopInterrupts() because there are no processes to use it. Regards, -- Masahiro Ikeda NTT DATA CORPORATIONdiff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 9b2eb0d10b..c7bda9127f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2536,7 +2536,6 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) Size nbytes; Size nleft; int written; - instr_time start; /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; @@ -2544,49 +2543,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) nleft = nbytes; do { -errno = 0; +written = XLogWriteFile(openLogFile, from, nleft, (off_t) startoffset, + ThisTimeLineID, openLogSegNo, wal_segment_size); -/* Measure I/O timing to write WAL data */ -if (track_wal_io_timing) - INSTR_TIME_SET_CURRENT(start); - -pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); -written = pg_pwrite(openLogFile, from, nleft, startoffset); -pgstat_report_wait_end(); - -/* - * Increment the I/O timing and the number of times WAL data - * were written out to disk. - */ -if (track_wal_io_timing) -{ - instr_time duration; - - INSTR_TIME_SET_CURRENT(duration); - INSTR_TIME_SUBTRACT(duration, start); - WalStats.m_wal_write_time += INSTR_TIME_GET_MICROSEC(duration); -} - -WalStats.m_wal_write++; - -if (written <= 0) -{ - char xlogfname[MAXFNAMELEN]; - int save_errno; - - if (errno == EINTR) - continue; - - save_errno = errno; - XLogFileName(xlogfn
Re: Collation versioning
On Fri, Nov 6, 2020 at 10:56 AM Thomas Munro wrote: > On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier wrote: > > On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote: > > > We could create a static table with the conversion based on what was > > > discussed for commit a169155, please find attached a spreadsheet with the > > > comparison. This would require maintenance as new LCIDs are released [1]. > > I am honestly not a fan of something like that as it has good chances > > to rot. > No opinion on that, other than that we'd surely want a machine > readable version. As for *when* we use that information, I'm > wondering if it would make sense to convert datcollate to a language > tag in initdb, and also change pg_upgrade's equivalent_locale() > function to consider "English_United States.*" and "en-US" to be > equivalent when upgrading to 14 (which would then be the only point > you'd ever have to have faith that we can convert the old style names > to the new names correctly). I'm unlikely to work on this myself as I > have other operating systems to fix, but I'll certainly be happy if > somehow we can get versioning for default on Windows in PG14 and not > have to come up with weasel words in the manual. FYI I have added this as an open item for PostgreSQL 14. My default action will be to document this limitation, if we can't come up with something better in time.
Re: Postgres crashes at memcopy() after upgrade to PG 13.
[Dropping pgsql-general@ from the CC, because cross-posting triggers moderation; sorry I didn't notice that on my first reply] On Mon, Mar 15, 2021 at 2:05 PM Avinash Kumar wrote: > On Sun, Mar 14, 2021 at 10:01 PM Avinash Kumar > wrote: >> Also the datatype is bigint Ok. Collation changes are the most common cause of index problems when upgrading OSes, but here we can rule that out if your index is on bigint. So it seems like this is some other kind of corruption in your database, or a bug in the deduplication code.
Re: Different compression methods for FPI
@cfbot: Resending without duplicate patches >From 581884bb12f666a1eb6a7f4bd769b1edeba4c563 Mon Sep 17 00:00:00 2001 From: Andrey Borodin Date: Sat, 27 Feb 2021 09:03:50 +0500 Subject: [PATCH 01/10] Allow alternate compression methods for wal_compression TODO: bump XLOG_PAGE_MAGIC --- doc/src/sgml/config.sgml | 17 + src/backend/Makefile | 2 +- src/backend/access/transam/xlog.c | 10 +++ src/backend/access/transam/xloginsert.c | 52 +-- src/backend/access/transam/xlogreader.c | 63 ++- src/backend/utils/misc/guc.c | 11 src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + src/include/access/xlog_internal.h| 8 +++ src/include/access/xlogreader.h | 1 + src/include/access/xlogrecord.h | 9 +-- 11 files changed, 163 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a218d78bef..7fb2a84626 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3072,6 +3072,23 @@ include_dir 'conf.d' + + wal_compressionion_method (enum) + + wal_compression_method configuration parameter + + + + +This parameter selects the compression method used to compress WAL when +wal_compression is enabled. +The supported methods are pglz and zlib. +The default value is pglz. +Only superusers can change this setting. + + + + wal_init_zero (boolean) diff --git a/src/backend/Makefile b/src/backend/Makefile index 0da848b1fd..3af216ddfc 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -48,7 +48,7 @@ OBJS = \ LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS) # The backend doesn't need everything that's in LIBS, however -LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS)) +LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS)) ifeq ($(with_systemd),yes) LIBS += -lsystemd diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f4d1ce5dea..15da91a8dd 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -99,6 +99,7 @@ bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; bool wal_compression = false; +int wal_compression_method = WAL_COMPRESSION_PGLZ; char *wal_consistency_checking_string = NULL; bool *wal_consistency_checking = NULL; bool wal_init_zero = true; @@ -180,6 +181,15 @@ const struct config_enum_entry recovery_target_action_options[] = { {NULL, 0, false} }; +/* Note that due to conditional compilation, offsets within the array are not static */ +const struct config_enum_entry wal_compression_options[] = { + {"pglz", WAL_COMPRESSION_PGLZ, false}, +#ifdef HAVE_LIBZ + {"zlib", WAL_COMPRESSION_ZLIB, false}, +#endif + {NULL, 0, false} +}; + /* * Statistics for current checkpoint are collected in this global struct. * Because only the checkpointer or a stand-alone backend can perform diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 7052dc245e..34e1227381 100644 --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -33,6 +33,10 @@ #include "storage/proc.h" #include "utils/memutils.h" +#ifdef HAVE_LIBZ +#include +#endif + /* Buffer size required to store a compressed version of backup block image */ #define PGLZ_MAX_BLCKSZ PGLZ_MAX_OUTPUT(BLCKSZ) @@ -113,7 +117,8 @@ static XLogRecData *XLogRecordAssemble(RmgrId rmid, uint8 info, XLogRecPtr RedoRecPtr, bool doPageWrites, XLogRecPtr *fpw_lsn, int *num_fpi); static bool XLogCompressBackupBlock(char *page, uint16 hole_offset, - uint16 hole_length, char *dest, uint16 *dlen); + uint16 hole_length, char *dest, + uint16 *dlen, WalCompression compression); /* * Begin constructing a WAL record. This must be called before the @@ -630,11 +635,12 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, */ if (wal_compression) { +bimg.compression_method = wal_compression_method; is_compressed = XLogCompressBackupBlock(page, bimg.hole_offset, cbimg.hole_length, regbuf->compressed_page, - &compressed_len); + &compressed_len, bimg.compression_method); } /* @@ -827,7 +833,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info, */ static bool XLogCompressBackupBlock(char *page, uint16 hole_offset, uint16 hole_length, - char *dest, uint16 *dlen) + char *dest, uint16 *dlen, WalCompression compression) { int32 orig_len = BLCKSZ - hole_length; int32 len; @@ -853,12 +859,48 @@ XLogCompressBackupBlock(char *page, ui
Re: REINDEX backend filtering
Hi Mark, On Sun, Mar 14, 2021 at 05:01:20PM -0700, Mark Dilger wrote: > > > On Mar 14, 2021, at 12:10 AM, Julien Rouhaud wrote: > > I'm coming to this patch quite late, perhaps too late to change design > decision in time for version 14. Thanks for lookint at it! > + if (outdated && PQserverVersion(conn) < 14) > + { > + PQfinish(conn); > + pg_log_error("cannot use the \"%s\" option on server versions > older than PostgreSQL %s", > + "outdated", "14"); > + exit(1); > + } > > If detection of outdated indexes were performed entirely in the frontend > (reindexdb) rather than in the backend (reindex command), would reindexdb be > able to connect to older servers? Looking quickly that the catalogs, it > appears pg_index, pg_depend, pg_collation and a call to the SQL function > pg_collation_actual_version() compared against pg_depend.refobjversion would > be enough to construct a list of indexes in need of reindexing. Am I missing > something here? There won't be any need to connect on older servers if the patch is committed in this commitfest as refobjversion was also added in pg14. > I understand that wouldn't help somebody wanting to reindex from psql. Is > that the whole reason you went a different direction with this feature? This was already discussed with Magnus and Michael. The main reason for that are: - no need for a whole new infrastructure to be able to process a list of indexes in parallel which would be required if getting the list of indexes in the client - if done in the backend, then the ability is immediately available for all user scripts, compared to the burden of writing the needed query (with the usual caveats like quoting, qualifying all objects if the search_path isn't safe and such) and looping though all the results. > + printf(_(" --outdated only process indexes having > outdated depencies\n")); > > typo. > > + bool outdated; /* depends on at least on deprected collation? */ > > typo. Thanks! I'll fix those.
Re: REINDEX backend filtering
On Mon, Mar 15, 2021 at 08:56:00AM +0900, Michael Paquier wrote: > On Sun, Mar 14, 2021 at 10:57:37PM +0800, Julien Rouhaud wrote: > > > > Is there really a use case for reindexing a specific index and at the same > > time > > asking for possibly ignoring it? I think we should just forbid REINDEX > > (OUTDATED) INDEX. What do you think? > > I think that there would be cases to be able to handle that, say if a > user wants to works on a specific set of indexes one-by-one. If a user want to work on a specific set of indexes one at a time, then the list of indexes is probably already retrieved from some SQL query and there's already all necessary infrastructure to properly filter the non oudated indexes. > There is > also the argument of inconsistency with the other commands. Yes, but the other commands dynamically construct a list of indexes. The only use case I see would be to process a partitioned index if some of the underlying indexes have already been processed. IMO this is better addressed by REINDEX TABLE. Anyway I'll make REINDEX (OUTDATED) INDEX to maybe reindex the explicitely stated index name since you think it's a better behavior. > > > I was thinking that users would be more interested in the list of indexes > > being > > processed rather than the full list of indexes and a mention of whether it > > was > > processed or not. I can change that if you prefer. > > How invasive do you think it would be to add a note in the verbose > output when indexes are skipped? Probably not too invasive, but the verbose output is already inconsistent: # reindex (verbose) table tt; NOTICE: 0: table "tt" has no indexes to reindex But a REINDEX (VERBOSE) DATABASE won't emit such message. I'm assuming that it's because it doesn't make sense to warn in that case as the user didn't explicitly specified the table name. We have the same behavior for now when using the OUTDATED option if no indexes are processed. Should that be changed too? > > Did you mean index_has_outdated_collation() and > > index_has_outdated_dependency()? It's just to keep things separated, mostly > > for future improvements on that infrastructure. I can get rid of that > > function > > and put back the code in index_has_outadted_dependency() if that's overkill. > > Yes, sorry. I meant index_has_outdated_collation() and > index_has_outdated_dependency(). And are you ok with this function?
Re: Different compression methods for FPI
On Sat, Mar 13, 2021 at 08:48:33PM +0500, Andrey Borodin wrote: > > 13 марта 2021 г., в 06:28, Justin Pryzby написал(а): > > Updated patch with a minor fix to configure.ac to avoid warnings on OSX. > > And 2ndary patches from another thread to allow passing recovery tests. > > Renamed to WAL_COMPRESSION_* > > Split LZ4 support to a separate patch and support zstd. These show that > > changes needed for a new compression method have been minimized, although > > not > > yet moved to a separate, abstracted compression/decompression function. > > Thanks! Awesome work! > > > These two patches are a prerequisite for this patch to progress: > > * Run 011_crash_recovery.pl with wal_level=minimal > > * Make sure published XIDs are persistent > > > > I don't know if anyone will consider this patch for v14 - if not, it should > > be > > set to v15 and revisit in a month. > > I want to note, that fixes for 011_crash_recovery.pl are not strictly > necessary for this patch set. > The problem in tests arises if we turn on wal_compression, absolutely > independently from wal compression method. > We turn on wal_compression in this test only for CI purposes. I rearranged the patches to reflect this. Change to zlib and zstd to level=1. Add support for negative "zstd fast" levels. Use correct length accounting for "hole" in LZ4 and ZSTD. Fixed Solution.pm for zstd on windows. Switch to zstd by default (for CI). Add docs. It occurred to me that the generic "compression API" might be of more significance if we supported compression of all WAL (not just FPI). I imagine that might use streaming compression. -- Justin >From 9288683238a72ae379d1b444e497285de5c8c28a Mon Sep 17 00:00:00 2001 From: Andrey Borodin Date: Sat, 27 Feb 2021 09:03:50 +0500 Subject: [PATCH 01/10] Allow alternate compression methods for wal_compression TODO: bump XLOG_PAGE_MAGIC --- doc/src/sgml/config.sgml | 17 + src/backend/Makefile | 2 +- src/backend/access/transam/xlog.c | 10 +++ src/backend/access/transam/xloginsert.c | 52 +-- src/backend/access/transam/xlogreader.c | 63 ++- src/backend/utils/misc/guc.c | 11 src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/access/xlog.h | 1 + src/include/access/xlog_internal.h| 8 +++ src/include/access/xlogreader.h | 1 + src/include/access/xlogrecord.h | 9 +-- 11 files changed, 163 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a218d78bef..7fb2a84626 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -3072,6 +3072,23 @@ include_dir 'conf.d' + + wal_compressionion_method (enum) + + wal_compression_method configuration parameter + + + + +This parameter selects the compression method used to compress WAL when +wal_compression is enabled. +The supported methods are pglz and zlib. +The default value is pglz. +Only superusers can change this setting. + + + + wal_init_zero (boolean) diff --git a/src/backend/Makefile b/src/backend/Makefile index 0da848b1fd..3af216ddfc 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -48,7 +48,7 @@ OBJS = \ LIBS := $(filter-out -lpgport -lpgcommon, $(LIBS)) $(LDAP_LIBS_BE) $(ICU_LIBS) # The backend doesn't need everything that's in LIBS, however -LIBS := $(filter-out -lz -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS)) +LIBS := $(filter-out -lreadline -ledit -ltermcap -lncurses -lcurses, $(LIBS)) ifeq ($(with_systemd),yes) LIBS += -lsystemd diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index e04250f4e9..04192b7add 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -99,6 +99,7 @@ bool EnableHotStandby = false; bool fullPageWrites = true; bool wal_log_hints = false; bool wal_compression = false; +int wal_compression_method = WAL_COMPRESSION_PGLZ; char *wal_consistency_checking_string = NULL; bool *wal_consistency_checking = NULL; bool wal_init_zero = true; @@ -180,6 +181,15 @@ const struct config_enum_entry recovery_target_action_options[] = { {NULL, 0, false} }; +/* Note that due to conditional compilation, offsets within the array are not static */ +const struct config_enum_entry wal_compression_options[] = { + {"pglz", WAL_COMPRESSION_PGLZ, false}, +#ifdef HAVE_LIBZ + {"zlib", WAL_COMPRESSION_ZLIB, false}, +#endif + {NULL, 0, false} +}; + /* * Statistics for current checkpoint are collected in this global struct. * Because only the checkpointer or a stand-alone backend can perform diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c index 7052dc245e..34e1227381 100644 --
Postgres crashes at memcopy() after upgrade to PG 13.
Hi , In one of the environments, using pg_upgrade with hard links, PostgreSQL 12 has been upgraded to PostgreSQL 13.1. The OS was Ubuntu 16.04.7 LTS (Xenial Xerus). pg_repack was used to rebuild all the tables across the database right after the upgrade to PG 13. A new server with Ubuntu 20.04.1 LTS was later provisioned. Streaming replication was set up from the Old Server running on Ubuntu 16 to New Server on Ubuntu 20 - same PG versions 13.1. Replication was running fine, but, after the failover to the New Server, an Update on a few random rows (not on the same page) was causing Segmentation fault and causing a crash of the Postgres. Selecting the records using the Index or directly from the table works absolutely fine. But, when the same records are updated, it gets into the following error. 2021-03-12 17:20:01.979 CET p#7 s#604b8fa9.7 t#0 LOG: terminating any other active server processes 2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 WARNING: terminating connection because of crash of another server process 2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2021-03-12 17:20:01.979 CET p#41 s#604b9212.29 t#0 HINT: In a moment you should be able to reconnect to the database and repeat your command. gdb backtrace looks like following with the debug symbols. (gdb) bt #0 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:533 #1 0x55b72761c370 in memmove (__len=, __src=0x55b72930e9c7, __dest=) at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:40 #2 _bt_swap_posting (newitem=newitem@entry=0x55b7292010c0, oposting=oposting@entry=0x7f3b46f94778, postingoff=postingoff@entry=2) at ./build/../src/backend/access/nbtree/nbtdedup.c:796 #3 0x55b72761d40b in _bt_insertonpg (rel=0x7f3acd8a49c0, itup_key=0x55b7292bc6a8, buf=507, cbuf=0, stack=0x55b7292d5f98, itup=0x55b7292010c0, itemsz=32, newitemoff=48, postingoff=2, split_only_page=false) at ./build/../src/backend/access/nbtree/nbtinsert.c:1167 #4 0x55b72761eae9 in _bt_doinsert (rel=rel@entry=0x7f3acd8a49c0, itup=itup@entry=0x55b7292bc848, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO, heapRel=heapRel@entry =0x7f3acd894f70) at ./build/../src/backend/access/nbtree/nbtinsert.c:1009 #5 0x55b727621e2e in btinsert (rel=0x7f3acd8a49c0, values=, isnull=, ht_ctid=0x55b7292d4578, heapRel=0x7f3acd894f70, checkUnique=UNIQUE_CHECK_NO, indexInfo=0x55b7292bc238) at ./build/../src/backend/access/nbtree/nbtree.c:210 #6 0x55b727757487 in ExecInsertIndexTuples (slot=slot@entry=0x55b7292d4548, estate=estate@entry=0x55b7291ff1f8, noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at ./build/../src/backend/executor/execIndexing.c:393 #7 0x55b7277807a8 in ExecUpdate (mtstate=0x55b7292bb2c8, tupleid=0x7fff45ea318a, oldtuple=0x0, slot=0x55b7292d4548, planSlot=0x55b7292c04e8, epqstate=0x55b7292bb3c0, estate=0x55b7291ff1f8, canSetTag=true) at ./build/../src/backend/executor/nodeModifyTable.c:1479 #8 0x55b727781655 in ExecModifyTable (pstate=0x55b7292bb2c8) at ./build/../src/backend/executor/nodeModifyTable.c:2253 #9 0x55b727758424 in ExecProcNode (node=0x55b7292bb2c8) at ./build/../src/include/executor/executor.h:248 #10 ExecutePlan (execute_once=, dest=0x55b7292c1728, direction=, numberTuples=0, sendTuples=, operation=CMD_UPDATE, use_parallel_mode=, planstate=0x55b7292bb2c8, estate=0x55b7291ff1f8) at ./build/../src/backend/executor/execMain.c:1632 #11 standard_ExecutorRun (queryDesc=0x55b7292ba578, direction=, count=0, execute_once=) at ./build/../src/backend/executor/execMain.c:350 #12 0x55b7278bebf7 in ProcessQuery (plan=, sourceText=0x55b72919efa8 "\031)\267U", params=0x0, queryEnv=0x0, dest=0x55b7292c1728, qc=0x7fff45ea34c0) at ./build/../src/backend/tcop/pquery.c:160 #13 0x55b7278bedf9 in PortalRunMulti (portal=portal@entry=0x55b729254128, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55b7292c1728, altdest=altdest@entry=0x55b7292c1728, qc=qc@entry=0x7fff45ea34c0) at ./build/../src/backend/tcop/pquery.c:1265 #14 0x55b7278bf847 in PortalRun (portal=portal@entry=0x55b729254128, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x55b7292c1728, --Type for more, q to quit, c to continue without paging-- Is this expected when replication is happening between PostgreSQL databases hosted on different OS versions like Ubuntu 16 and Ubuntu 20 ? Or, do we think this is some sort of corruption ? -- Regards, Avi.
Re: A qsort template
On Sun, Mar 14, 2021 at 5:03 PM Zhihong Yu wrote: > For 0001-Add-bsearch-and-unique-templates-to-sort_template.h.patch : > > + * Remove duplicates from an array. Return the new size. > + */ > +ST_SCOPE size_t > +ST_UNIQUE(ST_ELEMENT_TYPE *array, > > The array is supposed to be sorted, right ? > The comment should mention this. Good point, will update. Thanks!
Re: REINDEX backend filtering
> On Mar 14, 2021, at 12:10 AM, Julien Rouhaud wrote: > > v6 attached, rebase only due to conflict with recent commit. Hi Julien, I'm coming to this patch quite late, perhaps too late to change design decision in time for version 14. + if (outdated && PQserverVersion(conn) < 14) + { + PQfinish(conn); + pg_log_error("cannot use the \"%s\" option on server versions older than PostgreSQL %s", +"outdated", "14"); + exit(1); + } If detection of outdated indexes were performed entirely in the frontend (reindexdb) rather than in the backend (reindex command), would reindexdb be able to connect to older servers? Looking quickly that the catalogs, it appears pg_index, pg_depend, pg_collation and a call to the SQL function pg_collation_actual_version() compared against pg_depend.refobjversion would be enough to construct a list of indexes in need of reindexing. Am I missing something here? I understand that wouldn't help somebody wanting to reindex from psql. Is that the whole reason you went a different direction with this feature? + printf(_(" --outdated only process indexes having outdated depencies\n")); typo. + bool outdated; /* depends on at least on deprected collation? */ typo. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: REINDEX backend filtering
On Sun, Mar 14, 2021 at 10:57:37PM +0800, Julien Rouhaud wrote: > On Sun, Mar 14, 2021 at 08:54:11PM +0900, Michael Paquier wrote: >> In ReindexRelationConcurrently(), there is no filtering done for the >> index themselves. The operation is only done on the list of indexes >> fetched from the parent relation. Why? This means that a REINDEX >> (OUTDATED) INDEX would actually rebuild an index even if this index >> has no out-of-date collations, like a catalog. I think that's >> confusing. >> >> Same comment for the non-concurrent case, as of the code paths of >> reindex_index(). > > Yes, I'm not sure what we should do in that case. I thought I put a comment > about that but it apparently disappeared during some rewrite. > > Is there really a use case for reindexing a specific index and at the same > time > asking for possibly ignoring it? I think we should just forbid REINDEX > (OUTDATED) INDEX. What do you think? I think that there would be cases to be able to handle that, say if a user wants to works on a specific set of indexes one-by-one. There is also the argument of inconsistency with the other commands. > I was thinking that users would be more interested in the list of indexes > being > processed rather than the full list of indexes and a mention of whether it was > processed or not. I can change that if you prefer. How invasive do you think it would be to add a note in the verbose output when indexes are skipped? > Did you mean index_has_outdated_collation() and > index_has_outdated_dependency()? It's just to keep things separated, mostly > for future improvements on that infrastructure. I can get rid of that > function > and put back the code in index_has_outadted_dependency() if that's overkill. Yes, sorry. I meant index_has_outdated_collation() and index_has_outdated_dependency(). -- Michael signature.asc Description: PGP signature
Re: fdatasync performance problem with large number of DB files
On Mon, Mar 15, 2021 at 11:52 AM Thomas Munro wrote: > Time being of the essence, here is the patch I posted last year, this > time with a GUC and some docs. You can set sync_after_crash to > "fsync" (default) or "syncfs" if you have it. Cfbot told me to add HAVE_SYNCFS to Solution.pm, and I fixed a couple of typos. From 5b4a97bd99ba0ce789770c1dea48ea5544e175c7 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 27 Sep 2020 17:22:10 +1300 Subject: [PATCH v3] Optionally use syncfs() for SyncDataDirectory() on Linux. Opening every file can be slow, as the first step before crash recovery can begin. Provide an alternative method, where we call syncfs() on every possibly different filesystem under the data directory. This avoids faulting in inodes for potentially very many inodes. The option can be controlled with a new GUC: sync_after_crash={fsync,syncfs} Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com --- configure | 2 +- configure.ac | 1 + doc/src/sgml/config.sgml | 28 + src/backend/storage/file/fd.c | 62 ++- src/backend/utils/misc/guc.c | 17 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/pg_config.h.in| 3 + src/include/storage/fd.h | 8 +++ src/tools/msvc/Solution.pm| 1 + 9 files changed, 121 insertions(+), 2 deletions(-) diff --git a/configure b/configure index 3fd4cecbeb..6a2051da9d 100755 --- a/configure +++ b/configure @@ -15239,7 +15239,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev +for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.ac b/configure.ac index 2f1585adc0..dd819ab2c2 100644 --- a/configure.ac +++ b/configure.ac @@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([ strchrnul strsignal symlink + syncfs sync_file_range uselocale wcstombs_l diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a218d78bef..cc3843ac81 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + sync_after_crash (enum) + +sync_after_crash configuration parameter + + + + +When set to fsync, which is the default, +PostgreSQL will recursively open and fsync +all files in the data directory before crash recovery begins. This +is intended to make sure that all WAL and data files are durably stored +on disk before replaying changes. + + +On Linux, syncfs may be used instead, to ask the +operating system to flush the whole file system. This may be a lot +faster, because it doesn't need to open each file one by one. On the +other hand, it may be slower if the file system is shared by other +applications that modify a lot of files, since those files will also +be flushed to disk. Furthermore, on versions of Linux before 5.8, I/O +errors encountered while writing data to disk may not be reported to +PostgreSQL, and relevant error messages may +appear only in kernel logs. + + + + diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c index b58502837a..461a1a5b07 100644 --- a/src/backend/storage/file/fd.c +++ b/src/backend/storage/file/fd.c @@ -72,9 +72,11 @@ #include "postgres.h" +#include #include #include #include +#include #ifndef WIN32 #include #endif @@ -158,6 +160,9 @@ int max_safe_fds = FD_MINFREE; /* default if not changed */ /* Whether it is safe to continue running after fsync() fails. */ bool data_sync_retry = false; +/* How SyncDataDirectory() should do its job. */ +int sync_after_crash = SYNC_AFTER_CRASH_FSYNC; + /* Debugging */ #ifdef FDDEBUG @@ -3263,9 +3268,27 @@ looks_like_temp_rel_name(const char *name) return true; } +#ifde
Re: Boundary value check in lazy_tid_reaped()
On Wed, Sep 9, 2020 at 7:33 AM Peter Geoghegan wrote: > On Tue, Sep 8, 2020 at 1:23 AM Masahiko Sawada > wrote: > > > > I wonder if you would also see a speed-up with a bsearch() replacement > > > > that is inlineable, so it can inline the comparator (instead of > > > > calling it through a function pointer). I wonder if something more > > > > like (lblk << 32 | loff) - (rblk << 32 | roff) would go faster than > > > > the branchy comparator. > > > > > > Erm, off course that expression won't work... should be << 16, but > > > even then it would only work with a bsearch that uses int64_t > > > comparators, so I take that part back. > > > > Yeah, it seems to worth benchmarking the speed-up with an inlining. > > I'll do some performance tests with/without inlining on top of > > checking boundary values. > > It sounds like Thomas was talking about something like > itemptr_encode() + itemptr_decode(). In case you didn't know, we > actually do something like this for the TID tuplesort used for CREATE > INDEX CONCURRENTLY. BTW I got around to trying this idea out for a specialised bsearch_itemptr() using a wide comparator, over here: https://www.postgresql.org/message-id/CA%2BhUKGKztHEWm676csTFjYzortziWmOcf8HDss2Zr0muZ2xfEg%40mail.gmail.com
Re: fdatasync performance problem with large number of DB files
On Thu, Mar 11, 2021 at 2:32 PM Thomas Munro wrote: > On Thu, Mar 11, 2021 at 2:25 PM Tom Lane wrote: > > Trolling the net, I found a newer-looking version of the man page, > > and behold it says > > > >In mainline kernel versions prior to 5.8, syncfs() will fail only > >when passed a bad file descriptor (EBADF). Since Linux 5.8, > >syncfs() will also report an error if one or more inodes failed > >to be written back since the last syncfs() call. > > > > So this means that in less-than-bleeding-edge kernels, syncfs can > > only be regarded as a dangerous toy. If we expose an option to use > > it, there had better be large blinking warnings in the docs. > > Agreed. Perhaps we could also try to do something programmatic about that. Time being of the essence, here is the patch I posted last year, this time with a GUC and some docs. You can set sync_after_crash to "fsync" (default) or "syncfs" if you have it. I would plan to extend that to include a third option as already discussed in the other thread, maybe something like "wal" (= sync WAL files and then do extra analysis of WAL data to sync only data modified since checkpoint but not replayed), but that'd be material for PG15. From e1a14240a08b802b669c7a4d2a7cacc1004a7de4 Mon Sep 17 00:00:00 2001 From: Thomas Munro Date: Sun, 27 Sep 2020 17:22:10 +1300 Subject: [PATCH v2] Optionally use syncfs() for SyncDataDirectory() on Linux. Opening every file can be slow, as the first step before crash recovery can begin. Provide an alternative method, where we call syncfs() on every possibly different filesystem under the data directory. This avoids faulting in inodes for potentially very many inodes. The option can be controlled with a new GUC: sync_after_crash={fsync,syncfs} Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com --- configure | 2 +- configure.ac | 1 + doc/src/sgml/config.sgml | 28 + src/backend/storage/file/fd.c | 62 ++- src/backend/utils/misc/guc.c | 17 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/include/pg_config.h.in| 3 + src/include/storage/fd.h | 8 +++ 8 files changed, 120 insertions(+), 2 deletions(-) diff --git a/configure b/configure index 3fd4cecbeb..6a2051da9d 100755 --- a/configure +++ b/configure @@ -15239,7 +15239,7 @@ fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` -for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink sync_file_range uselocale wcstombs_l writev +for ac_func in backtrace_symbols clock_gettime copyfile fdatasync getifaddrs getpeerucred getrlimit kqueue mbstowcs_l memset_s poll posix_fallocate ppoll pstat pthread_is_threaded_np readlink readv setproctitle setproctitle_fast setsid shm_open strchrnul strsignal symlink syncfs sync_file_range uselocale wcstombs_l writev do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" diff --git a/configure.ac b/configure.ac index 2f1585adc0..dd819ab2c2 100644 --- a/configure.ac +++ b/configure.ac @@ -1681,6 +1681,7 @@ AC_CHECK_FUNCS(m4_normalize([ strchrnul strsignal symlink + syncfs sync_file_range uselocale wcstombs_l diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index a218d78bef..f8bd82a729 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -9679,6 +9679,34 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir' + + sync_after_crash (enum) + +sync_after_crash configuration parameter + + + + +When set to fsync, which is the default, +PostgreSQL will recursively open and fsync +all files in the data directory before crash recovery begins. This +is intended to make sure that all WAL and data files are durably stored +on disk before replaying changes. + + +On Linux, syncfs may be used instead, to ask the +operating system to flush the whole file system. This may be a lot +faster, because it doesn't need to open each file one by one. On the +other hand, it may be slower if the file system is shared by other +applications that modify a lot of files, since those files will also +be flushed to disk. Furthermore, on version of Linux before 5.8, I/O +errors encountered while writing data to disk may not be reported to +
A Case For Inlining Immediate Referential Integrity Checks
A Case For Inlining Immediate Referential Integrity Checks -- The following is an overview of how Postgres currently implemented referential integrity, the some problems with that architecture, attempted solutions for those problems, and a suggstion of another possible solution. Notes On Notation and Referential Integrity In General -- All referential integrity is ultimately of this form: R(X) => T(Y) Where one referencing table R has a set of columns X That references a set of columns Y which comprise a unique constraint on a target table T. Note that the Y-set of columns is usually the primary key of T, but does not have to be. The basic referential integrity checks fall into two basic categories, Insert and Delete, which can be checked Immediately following the statement, or can be Deferred to the end of the transaction. The Insert check is fairly straightforward. Any insert to R, or update of R that modifies [1] any column in X, is checked to see if all of the X columns are NOT NULL, and if so, a lookup is done on T to find a matching row tuple of Y. If none is found, then an error is raised. The Update check is more complicated, as it covers any UPDATE operation that modifies [1] any column in Y, where all of the values of Y are NOT NUL, as well as DELETE operation where all of the columns of Y are NOT NULL. For any Update check, the table R is scanned for any matching X tuples matching Y in the previous, and for any matches found, an action is taken. That action can be to fail the operation (NO ACTION, RESTRICT), update the X values to fixed values (SET NULL, SET DEFAULT), or to delete those rows in R (CASCADE). Current Implementation -- Currently, these operations are handled via per-row triggers. In our general case, one trigger is placed on R for INSERT operations, and one trigger is placed on T for DELETE operations, and an additional trigger is placed on T for UPDATE operations that affect any column of Y. These Insert trigger functions invoke the C function RI_FKey_check() [2]. The trigger is fired unconditionally, and the trigger itself determines if there is a referential integrity constraint to be made or not. Ultimately this trigger invokes an SPI query of the form SELECT 1 FROM WHERE () FOR KEY SHARE. This query is generally quite straightforward to the planner, as it becomes either a scan of a single unique index, or a partition search followed by a scan of a single unique index. The operation succeeds if a row is found, and fails if it does not. The Update trigger functions are implemented with a set of C functions RI_[noaction|restrict|cascade|setnull|setdefault]_[upd|del]() [3]. These functions each generate a variation of SPI query in one of the following forms cascade: DELETE FROM WHERE restrict/noaction: SELECT 1 FROM WHERE FOR KEY SHARE setnull: UPDATE SET x1 = NULL, ... WHERE setdefault: UPDATE SET x1 = DEFAULT, ... WHERE These triggers are either executed at statement time (Immediate) or are queued for execution as a part of the transaction commit (Deferred). Problems With The Current Implementation The main problems with this architecture come down to visiblity and performance. The foremost problem with this implementation is that these extra queries are not visible to the end user in any way. It is possible to infer that the functions executed by looking at the constraint defnitions and comparing pg_stat_user_tables or pg_stat_user_indexes before and after the operation, but in general the time spent in these functions accrues to the DML statement (Immediate) or COMMIT statement (Deferred) without any insght into what took place. This is especially vexing in situations where an operation as simple as "DELETE FROM highly_referenced_table WHERE id = 1" hits the primary key index, but takes several seconds to run. The performance of Insert operations is generally not too bad, in that query boils down to an Index Scan for a single row. The problem, however, is that this query must be executed for every row inserted. The query itself is only planned once, and that query plan is cached for later re-use. That removes some of the query overhead, but also incurs a growing cache of plans which can create memory pressure if the number of foreign keys is large, and indeed this has become a problem for at least one customer [4]. Some profiling of the RI check indicated that about half of the time of the insert was spent in SPI functions that could be bypassed if the C function called index_beginscan and index_rescan directly [5]. And these indications bore out when Amit Langote wrote a patch [6] which finds the designanted index from the constraint (with some drilling through partitions if need be) and then invokes the scan functions. This method showed about a halving of the t
Re: pl/pgsql feature request: shorthand for argument and local variable references
Hi so 13. 3. 2021 v 12:48 odesílatel Julien Rouhaud napsal: > On Sat, Mar 13, 2021 at 12:10:29PM +0100, Pavel Stehule wrote: > > > > so 13. 3. 2021 v 9:53 odesílatel Julien Rouhaud > napsal: > > > > > > I don't think that it makes sense to have multiple occurences of this > > > command, > > > and we should simply error out if > plpgsql_curr_compile->root_ns->itemno is > > > PLPGSQL_LABEL_REPLACED. If any case this should also be covered in the > > > regression tests. > > > > > > > I did it. Thank you for check > > Thanks, LGTM. > > > I wrote few sentences to documentation > > Great! > > I just had a few cosmetic comments: > > + block can be changed by inserting special command at the start of > the function > [...] > + arguments) is possible with option routine_label: > [...] > +errhint("The option \"routine_label\" can > be used only once in rutine."), > [...] > +-- Check root namespace renaming (routine_label option) > > You're sometimes using "command" and "sometimes" option. Should we always > use > the same term, maybe "command" as it's already used for #variable_conflict > documentation? > > Also > > + variables can be qualified with the function's name. The name of > this outer > + block can be changed by inserting special command at the start of > the function > + #routine_label new_name. > > It's missing a preposition before "special command". Maybe > > + variables can be qualified with the function's name. The name of > this outer > + block can be changed by inserting a special command > + #routine_label new_name at the start of the > function. > > > + The function's argument can be qualified by function name: > > Should be "qualified with the function name" > > + Sometimes the function name is too long and can be more practical to > use > + some shorter label. An change of label of top namespace (with > functions's > + arguments) is possible with option routine_label: > > Few similar issues. How about: > > Sometimes the function name is too long and *it* can be more practical to > use > some shorter label. *The top namespace label can be changed* (*along* with > *the* functions' arguments) *using the option* > routine_label: > > I'm not a native English speaker so the proposes changed may be wrong or > not > enough. > My English is good enough for taking beer everywhere in the world :) . Ti is not good, but a lot of people who don't understand to English understand my simplified fork of English language. Thank you for check updated patch attached Pavel diff --git a/doc/src/sgml/plpgsql.sgml b/doc/src/sgml/plpgsql.sgml index 9242c54329..1af56eef86 100644 --- a/doc/src/sgml/plpgsql.sgml +++ b/doc/src/sgml/plpgsql.sgml @@ -292,7 +292,10 @@ $$ LANGUAGE plpgsql; special variables such as FOUND (see ). The outer block is labeled with the function's name, meaning that parameters and special - variables can be qualified with the function's name. + variables can be qualified with the function's name. The name of this outer + block can be changed by inserting special command + #routine_label new_name at the start of the function. + . @@ -435,6 +438,31 @@ $$ LANGUAGE plpgsql; + + The function's argument can be qualified with the function name: + +CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$ +BEGIN +RETURN sales_tax.subtotal * 0.06; +END; +$$ LANGUAGE plpgsql; + + + + + Sometimes the function name is too long and it can be more practical to use + some shorter label. The top namespace label can be changed (along with + the functions' arguments) using the option routine_label: + +CREATE FUNCTION sales_tax(subtotal real) RETURNS real AS $$ +#routine_label s +BEGIN +RETURN s.subtotal * 0.06; +END; +$$ LANGUAGE plpgsql; + + + Some more examples: diff --git a/src/pl/plpgsql/src/pl_comp.c b/src/pl/plpgsql/src/pl_comp.c index ce8d97447d..d32e050c32 100644 --- a/src/pl/plpgsql/src/pl_comp.c +++ b/src/pl/plpgsql/src/pl_comp.c @@ -378,6 +378,10 @@ do_compile(FunctionCallInfo fcinfo, */ plpgsql_ns_init(); plpgsql_ns_push(NameStr(procStruct->proname), PLPGSQL_LABEL_BLOCK); + + /* save top ns for possibility to alter top label */ + function->root_ns = plpgsql_ns_top(); + plpgsql_DumpExecTree = false; plpgsql_start_datums(); diff --git a/src/pl/plpgsql/src/pl_funcs.c b/src/pl/plpgsql/src/pl_funcs.c index 919b968826..9dc07f9e58 100644 --- a/src/pl/plpgsql/src/pl_funcs.c +++ b/src/pl/plpgsql/src/pl_funcs.c @@ -105,6 +105,40 @@ plpgsql_ns_additem(PLpgSQL_nsitem_type itemtype, int itemno, const char *name) ns_top = nse; } +/* + * Replace ns item of label type by creating new entry and redirect + * old entry to new one. + */ +void +plpgsql_ns_replace_root_label(PLpgSQL_nsitem *nse, + const char *name, + int location) +{ + PLpgSQL_nsitem *new_nse; + +
Re: New IndexAM API controlling index vacuum strategies
On Sat, Mar 13, 2021 at 7:23 PM Peter Geoghegan wrote: > In other words, I am not worried about debt, exactly. Debt is normal > in moderation. Healthy, even. I am worried about bankruptcy, perhaps > following a rare and extreme event. It's okay to be imprecise, but all > of the problems must be survivable. The important thing to me for a > maintenance_work_mem threshold is that there is *some* limit. At the > same time, it may totally be worth accepting 2 or 3 index scans during > some eventual VACUUM operation if there are many more VACUUM > operations that don't even touch the index -- that's a good deal! > Also, it may actually be inherently necessary to accept a small risk > of having a future VACUUM operation that does multiple scans of each > index -- that is probably a necessary part of skipping index vacuuming > each time. > > Think about the cost of index vacuuming (the amount of I/O and the > duration of index vacuuming) as less as less memory is available for > TIDs. It's non-linear. The cost explodes once we're past a certain > point. The truly important thing is to "never get killed by the > explosion". I just remembered this blog post, which gives a nice high level summary of my mental model for things like this: https://jessitron.com/2021/01/18/when-costs-are-nonlinear-keep-it-small/ This patch should eliminate inefficient index vacuuming involving very small "batch sizes" (i.e. a small number of TIDs/index tuples to delete from indexes). At the same time, it should not allow the batch size to get too large because that's also inefficient. Perhaps larger batch sizes are not exactly inefficient -- maybe they're risky. Though risky is actually kind of the same thing as inefficient, at least to me. So IMV what we want to do here is to recognize cases where "batch size" is so small that index vacuuming couldn't possibly be efficient. We don't need to truly understand how that might change over time in each case -- this is relatively easy. There is some margin for error here, even with this reduced-scope version that just does the SKIP_VACUUM_PAGES_RATIO thing. The patch can afford to make suboptimal decisions about the scheduling of index vacuuming over time (relative to the current approach), provided the additional cost is at least *tolerable* -- that way we are still very likely to win in the aggregate, over time. However, the patch cannot be allowed to create a new risk of significantly worse performance for any one VACUUM operation. -- Peter Geoghegan
Re: Transactions involving multiple postgres foreign servers, take 2
On Thu, Feb 11, 2021 at 6:25 PM Masahiko Sawada wrote: > On Fri, Feb 5, 2021 at 2:45 PM Masahiko Sawada > wrote: > > > > On Tue, Feb 2, 2021 at 5:18 PM Fujii Masao > wrote: > > > > > > > > > > > > On 2021/01/27 14:08, Masahiko Sawada wrote: > > > > On Wed, Jan 27, 2021 at 10:29 AM Fujii Masao > > > > wrote: > > > >> > > > >> > > > >> You fixed some issues. But maybe you forgot to attach the latest > patches? > > > > > > > > Yes, I've attached the updated patches. > > > > > > Thanks for updating the patch! I tried to review 0001 and 0002 as the > self-contained change. > > > > > > + * An FDW that implements both commit and rollback APIs can request > to register > > > + * the foreign transaction by FdwXactRegisterXact() to participate it > to a > > > + * group of distributed tranasction. The registered foreign > transactions are > > > + * identified by OIDs of server and user. > > > > > > I'm afraid that the combination of OIDs of server and user is not > unique. IOW, more than one foreign transactions can have the same > combination of OIDs of server and user. For example, the following two > SELECT queries start the different foreign transactions but their user OID > is the same. OID of user mapping should be used instead of OID of user? > > > > > > CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw; > > > CREATE USER MAPPING FOR postgres SERVER loopback OPTIONS (user > 'postgres'); > > > CREATE USER MAPPING FOR public SERVER loopback OPTIONS (user > 'postgres'); > > > CREATE TABLE t(i int); > > > CREATE FOREIGN TABLE ft(i int) SERVER loopback OPTIONS > (table_name 't'); > > > BEGIN; > > > SELECT * FROM ft; > > > DROP USER MAPPING FOR postgres SERVER loopback ; > > > SELECT * FROM ft; > > > COMMIT; > > > > Good catch. I've considered using user mapping OID or a pair of user > > mapping OID and server OID as a key of foreign transactions but I > > think it also has a problem if an FDW caches the connection by pair of > > server OID and user OID whereas the core identifies them by user > > mapping OID. For instance, mysql_fdw manages connections by pair of > > server OID and user OID. > > > > For example, let's consider the following execution: > > > > BEGIN; > > SET ROLE user_A; > > INSERT INTO ft1 VALUES (1); > > SET ROLE user_B; > > INSERT INTO ft1 VALUES (1); > > COMMIT; > > > > Suppose that an FDW identifies the connections by {server OID, user > > OID} and the core GTM identifies the transactions by user mapping OID, > > and user_A and user_B use the public user mapping to connect server_X. > > In the FDW, there are two connections identified by {user_A, sever_X} > > and {user_B, server_X} respectively, and therefore opens two > > transactions on each connection, while GTM has only one FdwXact entry > > because the two connections refer to the same user mapping OID. As a > > result, at the end of the transaction, GTM ends only one foreign > > transaction, leaving another one. > > > > Using user mapping OID seems natural to me but I'm concerned that > > changing role in the middle of transaction is likely to happen than > > dropping the public user mapping but not sure. We would need to find > > more better way. > > After more thought, I'm inclined to think it's better to identify > foreign transactions by user mapping OID. The main reason is, I think > FDWs that manages connection caches by pair of user OID and server OID > potentially has a problem with the scenario Fujii-san mentioned. If an > FDW has to use another user mapping (i.g., connection information) due > to the currently used user mapping being removed, it would have to > disconnect the previous connection because it has to use the same > connection cache. But at that time it doesn't know the transaction > will be committed or aborted. > > Also, such FDW has the same problem that postgres_fdw used to have; a > backend establishes multiple connections with the same connection > information if multiple local users use the public user mapping. Even > from the perspective of foreign transaction management, it more makes > sense that foreign transactions correspond to the connections to > foreign servers, not to the local connection information. > > I can see that some FDW implementations such as mysql_fdw and > firebird_fdw identify connections by pair of server OID and user OID > but I think this is because they consulted to old postgres_fdw code. I > suspect that there is no use case where FDW needs to identify > connections in that way. If the core GTM identifies them by user > mapping OID, we could enforce those FDWs to change their way but I > think that change would be the right improvement. > > Regards, > > -- > Masahiko Sawada > EDB: https://www.enterprisedb.com/ > > > Regression is failing, can you please take a look. https://cirrus-ci.com/task/5522445932167168 t/080_pg_isready.pl ... ok # Failed test 'parallel reindexdb for system with --concurrently skips catalog
Re: pgbench - add pseudo-random permutation function
Hello Alvaro, + /*- +* Apply 4 rounds of bijective transformations using key updated +* at each stage: +* +* (1) whiten: partial xors on overlapping power-of-2 subsets +* for instance with v in 0 .. 14 (i.e. with size == 15): +* if v is in 0 .. 7 do v = (v ^ k) % 8 +* if v is in 7 .. 14 do v = 14 - ((14-v) ^ k) % 8 +* note that because of the overlap (here 7), v may be changed twice. +* this transformation if bijective because the condition to apply it +* is still true after applying it, and xor itself is bijective on a +* power-of-2 size. +* +* (2) scatter: linear modulo +* v = (v * p + k) % size +* this transformation is bijective is p & size are prime, which is +* ensured in the code by the while loop which discards primes when +* size is a multiple of it. +* +*/ My main question on this now is, do you have a scholar reference for this algorithm? Nope, otherwise I would have put a reference. I'm a scholar though, if it helps:-) I could not find any algorithm that fitted the bill. The usual approach (eg benchmarking designs) is too use some hash function and assume that it is not a permutation, too bad. Basically the algorithm mimics a few rounds of cryptographic encryption adapted to any size and simple operators, whereas encryption function are restricted to power of two blocks (eg the Feistel network). The structure is the same AES with its AddRoundKey the xor-ing stage (adapted to non power of two in whiten above), MixColumns which does the scattering, and for key expansion, I used Donald Knuth generator. Basically one could say that permute is an inexpensive and insecure AES:-) We could add a reference to AES for the structure of the algorithm itself, but otherwise I just iterated designs till I was satisfied with the result (again: inexpensive and constant cost, any size, permutation…). -- Fabien.
Re: More time spending with "delete pending"
Hello hackers, 22.11.2020 18:59, Alexander Lakhin wrote: > 19.11.2020 01:28, Tom Lane wrote: > >> Hmm, that is an interesting question isn't it. Here's a search going >> back a full year. There are a few in v12 --- interestingly, all on >> the statistics file, none from pg_ctl --- and none in v13. Of course, >> v13 has only existed as a separate branch since 2020-06-07. >> >> There's also a buildfarm test-coverage artifact involved. The bulk >> of the HEAD reports are from dory and walleye, neither of which are >> building v13 as yet, because of unclear problems [1]. I think those >> two animals build much more frequently than our other Windows animals, >> too, so the fact that they have more may be just because of that and >> not because they're somehow more susceptible. Because of that, I'm not >> sure that we have enough evidence to say that v13 is better than HEAD. >> If there is some new bug, it's post-v12, but maybe not post-v13. But >> v12 is evidently not perfect either. > The failures on HEAD (on dory and walleye) need another investigation > (maybe they are caused by modified stat()...). To revive this topic and make sure that the issue still persists, I've searched through the buildfarm logs (for HEAD and REL_13_STABLE) and collected the following failures of the misc_functions test with the "Permission denied" error: bowerbird HEAD (13 runs last week, 500 total, 18 failed) - REL_13_STABLE (4 runs last week, 237 total, 8 failed) - drongo HEAD (25 runs last week, 500 total, 24 failed) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2020-12-27%2019%3A01%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2021-01-18%2001%3A01%3A55 REL_13_STABLE (5 runs last week, 255 total, 2 failed) - hamerkop HEAD - (8 runs last week, 501 total, 20 failed) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamerkop&dt=2021-02-22%2010%3A20%3A00 fairywren HEAD - (27 runs last week, 500 total, 28 failed) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-02-22%2012%3A03%3A35 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-28%2000%3A04%3A50 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-19%2023%3A29%3A38 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-01-17%2018%3A08%3A11 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-30%2009%3A57%3A15 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-19%2006%3A02%3A22 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-11-18%2018%3A02%3A04 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-16%2019%3A48%3A21 REL_13_STABLE (5 runs last week, 302 total, 5 failed) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2021-02-26%2003%3A03%3A42 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-10-20%2007%3A50%3A44 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-10%2006%3A02%3A12 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-09-04%2019%3A58%3A39 woodlouse HEAD (33 runs last week, 500 total, 5 failed) - REL_13_STABLE (4 runs last week, 325 total, 1 failed) https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=woodlouse&dt=2021-01-20%2011%3A42%3A30 dory HEAD (52 runs last week, 500 total, 25 failed) - whelk HEAD (31 runs last week, 500 total, 11 failed) - REL_13_STABLE (7 runs last week, 224 total, 1 failed) - (I've not considered lorikeet, jacana, and walleye as they are using alternative compilers.) So on some machines there wasn't a single failure (probably it depends on hardware), but where it fails, such failures account for a significant share of all failures. I believe that the patch attached to [1] should fix this issue. The patch still applies to master and makes the demotest (attached to [2]) pass. Also I've prepared a trivial patch that makes pgwin32_open() use the original stat() function (as in the proposed change for _pgstat64()). https://www.postgresql.org/message-id/979f4ee6-9960-e37f-f3ee-f458e8b0%40gmail.com https://www.postgresql.org/message-id/c3427edf-d7c0-ff57-90f6-b5de3bb62709%40gmail.com Best regards, Alexander diff --git a/src/port/open.c b/src/port/open.c index 14c6debba9..4bd11ca9d9 100644 --- a/src/port/open.c +++ b/src/port/open.c @@ -157,9 +157,9 @@ pgwin32_open(const char *fileName, int fileFlags,...) { if (loops < 10) { -struct stat st; +struct microsoft_native_stat st; -if (stat(fileName, &st) != 0) +if (microsoft_native_stat(fileName, &st) != 0) { pg_usleep(10); loops++;
Re: REINDEX backend filtering
On Sun, Mar 14, 2021 at 08:54:11PM +0900, Michael Paquier wrote: > On Sun, Mar 14, 2021 at 04:10:07PM +0800, Julien Rouhaud wrote: > > + booloutdated_filter = false; > Wouldn't it be better to rename that "outdated" instead for > consistency with the other options? I agree. > In ReindexRelationConcurrently(), there is no filtering done for the > index themselves. The operation is only done on the list of indexes > fetched from the parent relation. Why? This means that a REINDEX > (OUTDATED) INDEX would actually rebuild an index even if this index > has no out-of-date collations, like a catalog. I think that's > confusing. > > Same comment for the non-concurrent case, as of the code paths of > reindex_index(). Yes, I'm not sure what we should do in that case. I thought I put a comment about that but it apparently disappeared during some rewrite. Is there really a use case for reindexing a specific index and at the same time asking for possibly ignoring it? I think we should just forbid REINDEX (OUTDATED) INDEX. What do you think? > Would it be better to inform the user which indexes are getting > skipped in the verbose output if REINDEXOPT_VERBOSE is set? I was thinking that users would be more interested in the list of indexes being processed rather than the full list of indexes and a mention of whether it was processed or not. I can change that if you prefer. > + > +Check if the specified index has any outdated dependency. For now > only > +dependency on collations are supported. > + > [...] > +OUTDATED > + > + > + This option can be used to filter the list of indexes to rebuild and > only > + process indexes that have outdated dependencies. Fow now, the only > + handle dependency is for the collation provider version. > + > Do we really need to be this specific in this part of the > documentation with collations? I think it's important to document what this option really does, and I don't see a better place to document it. > The last sentence of this paragraph > sounds weird. Don't you mean instead to write "the only dependency > handled currently is the collation provider version"? Indeed, I'll fix! > +\set VERBOSITY terse \\ -- suppress machine-dependent details > +-- no suitable index should be found > +REINDEX (OUTDATED) TABLE reindex_coll; > What are those details? That just the same comment as the previous occurence in the file, I kept it for consistency. > And wouldn't it be more stable to just check > after the relfilenode of the indexes instead? Agreed, I'll add additional tests for that. > " ORDER BY sum(ci.relpages)" > Schema qualification here, twice. Well, this isn't actually mandatory, per comment at the top: /* * The queries here are using a safe search_path, so there's no need to * fully qualify everything. */ But I think it's a better style to fully qualify objects, so I'll fix. > + rel = try_relation_open(indexOid, AccessShareLock); > + > + if (rel == NULL) > + PG_RETURN_NULL(); > Let's introduce a try_index_open() here. Good idea! > What's the point in having both index_has_outdated_collation() and > index_has_outdated_collation()? Did you mean index_has_outdated_collation() and index_has_outadted_dependency()? It's just to keep things separated, mostly for future improvements on that infrastructure. I can get rid of that function and put back the code in index_has_outadted_dependency() if that's overkill. > It seems to me that 0001 should be split into two patches: > - One for the backend OUTDATED option. > - One for pg_index_has_outdated_dependency(), which only makes sense > in-core once reindexdb is introduced. I thought it would be better to add the backend part in a single commit, and then built the client part on top of that in a different commit. I can rearrange things if you want, but in that case should index_has_outadted_dependency() be in a different patch as you mention or simply merged with 0002 (the --oudated option for reindexdb)?
Re: pgbench - add pseudo-random permutation function
On 2021-Mar-14, Fabien COELHO wrote: > + /*- > + * Apply 4 rounds of bijective transformations using key updated > + * at each stage: > + * > + * (1) whiten: partial xors on overlapping power-of-2 subsets > + * for instance with v in 0 .. 14 (i.e. with size == 15): > + * if v is in 0 .. 7 do v = (v ^ k) % 8 > + * if v is in 7 .. 14 do v = 14 - ((14-v) ^ k) % 8 > + * note that because of the overlap (here 7), v may be changed > twice. > + * this transformation if bijective because the condition to apply > it > + * is still true after applying it, and xor itself is bijective on a > + * power-of-2 size. > + * > + * (2) scatter: linear modulo > + * v = (v * p + k) % size > + * this transformation is bijective is p & size are prime, which is > + * ensured in the code by the while loop which discards primes when > + * size is a multiple of it. > + * > + */ My main question on this now is, do you have a scholar reference for this algorithm? -- Álvaro Herrera Valdivia, Chile "Someone said that it is at least an order of magnitude more work to do production software than a prototype. I think he is wrong by at least an order of magnitude." (Brian Kernighan)
Re: Using COPY FREEZE in pgbench
> Ok. ISTM that the option should be triggered as soon as it is > available, but you do as you wish. Can you elaborate why you think that using COPY FREEZE before 14 is beneficial? Or do you want to standardize to use COPY FREEZE? > I'm unsure how this could happen ever. The benefit of not caring is > less lines of codes in pgbench and a later call will catch the issue > anyway, if libpq is corrupted. I have looked in the code of PQprotocolVersion. The only case in which it returns 0 is there's no connection. Yes, you are right. Once the connection has been successfuly established, there's no chance it fails. So I agree with you. >>> Question: should there be a word about copy using the freeze option? >>> I'd say yes, in the section describing "g". >> >> Can you elaborate about "section describing "g"? I am not sure what >> you mean. > > The "g" item in the section describing initialization steps > (i.e. option -I). I'd suggest just to replace "COPY" with "COPY > FREEZE" in the sentence. Ok. The section is needed to be modified. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
Re: Fallback table AM for relkinds without storage
On Wed, Feb 24, 2021 at 11:51:36AM +0900, Michael Paquier wrote: > For the file name, using something like pseudo_handler.c or similar > would be fine, I guess. However, if we go down the path of one AM per > relkind for the slot callback, then why not just calling the AMs > foreign_table_am, view_am and part_table_am? This could be coupled > with sanity checks to make sure that AMs assigned to those relations > are the expected ones. I am still not quite sure what needs to be done here and this needs more thoughts, so this has been marked as returned with feedback for now. Instead of pushing forward with this patch, I'll just spend more cycles on stuff that has more chances to make it into 14. -- Michael signature.asc Description: PGP signature
Re: REINDEX backend filtering
On Sun, Mar 14, 2021 at 04:10:07PM +0800, Julien Rouhaud wrote: > v6 attached, rebase only due to conflict with recent commit. I have read through the patch. + booloutdated_filter = false; Wouldn't it be better to rename that "outdated" instead for consistency with the other options? In ReindexRelationConcurrently(), there is no filtering done for the index themselves. The operation is only done on the list of indexes fetched from the parent relation. Why? This means that a REINDEX (OUTDATED) INDEX would actually rebuild an index even if this index has no out-of-date collations, like a catalog. I think that's confusing. Same comment for the non-concurrent case, as of the code paths of reindex_index(). Would it be better to inform the user which indexes are getting skipped in the verbose output if REINDEXOPT_VERBOSE is set? + +Check if the specified index has any outdated dependency. For now only +dependency on collations are supported. + [...] +OUTDATED + + + This option can be used to filter the list of indexes to rebuild and only + process indexes that have outdated dependencies. Fow now, the only + handle dependency is for the collation provider version. + Do we really need to be this specific in this part of the documentation with collations? The last sentence of this paragraph sounds weird. Don't you mean instead to write "the only dependency handled currently is the collation provider version"? +\set VERBOSITY terse \\ -- suppress machine-dependent details +-- no suitable index should be found +REINDEX (OUTDATED) TABLE reindex_coll; What are those details? And wouldn't it be more stable to just check after the relfilenode of the indexes instead? " ORDER BY sum(ci.relpages)" Schema qualification here, twice. + printf(_(" --outdated only process indexes having outdated depencies\n")); s/depencies/dependencies/. + rel = try_relation_open(indexOid, AccessShareLock); + + if (rel == NULL) + PG_RETURN_NULL(); Let's introduce a try_index_open() here. What's the point in having both index_has_outdated_collation() and index_has_outdated_collation()? It seems to me that 0001 should be split into two patches: - One for the backend OUTDATED option. - One for pg_index_has_outdated_dependency(), which only makes sense in-core once reindexdb is introduced. -- Michael signature.asc Description: PGP signature
Re: pgbench - add pseudo-random permutation function
See attached v22. v23: - replaces remaining occurences of "pr_perm" in the documentation - removes duplicated includes -- Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml index 299d93b241..9f49a6a78d 100644 --- a/doc/src/sgml/ref/pgbench.sgml +++ b/doc/src/sgml/ref/pgbench.sgml @@ -1057,7 +1057,7 @@ pgbench options d default_seed - seed used in hash functions by default + seed used in hash and pseudo-random permutation functions by default @@ -1842,6 +1842,22 @@ SELECT 4 AS four \; SELECT 5 AS five \aset + + +permute ( i, size [, seed ] ) +integer + + +Permuted value of i, in [0,size). +This is the new position of i in a pseudo-random rearrangement of +size first integers parameterized by seed. + + +permute(0, 4) +an integer between 0 and 3 + + + pi () @@ -1960,6 +1976,20 @@ SELECT 4 AS four \; SELECT 5 AS five \aset shape of the distribution. + + + When designing a benchmark which selects rows non-uniformly, be aware + that using pgbench non-uniform random functions + directly leads to unduly correlations: rows with close ids come out with + similar probability, skewing performance measures because they also reside + in close pages. + + + You must use permute to shuffle the selected ids, or + some other additional step with similar effect, to avoid these skewing impacts. + + + @@ -2054,24 +2084,54 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) / hash_fnv1a accept an input value and an optional seed parameter. In case the seed isn't provided the value of :default_seed is used, which is initialized randomly unless set by the command-line --D option. Hash functions can be used to scatter the -distribution of random functions such as random_zipfian or -random_exponential. For instance, the following pgbench -script simulates possible real world workload typical for social media and +-D option. + + + +Function permute implements a parametric pseudo-random permutation +on an arbitrary size. +It allows to shuffle output of non uniform random functions so that +values drawn more often are not trivially correlated. +It permutes integers in [0, size) using a seed by applying rounds of +simple invertible functions, similarly to an encryption function, +although beware that it is not at all cryptographically secure. +Compared to hash functions discussed above, the function +ensures that a perfect permutation is applied: there are neither collisions +nor holes in the output values. +Values outside the interval are interpreted modulo the size. +The function raises an error if size is not positive. +If no seed is provided, :default_seed is used. + +For instance, the following pgbench script +simulates possible real world workload typical for social media and blogging platforms where few accounts generate excessive load: -\set r random_zipfian(0, 1, 1.07) -\set k abs(hash(:r)) % 100 +\set size 100 +\set r random_zipfian(1, :size, 1.07) +\set k 1 + permute(:r, :size) In some cases several distinct distributions are needed which don't correlate with each other and this is when implicit seed parameter comes in handy: -\set k1 abs(hash(:r, :default_seed + 123)) % 100 -\set k2 abs(hash(:r, :default_seed + 321)) % 100 +\set k1 1 + permute(:r, :size, :default_seed + 123) +\set k2 1 + permute(:r, :size, :default_seed + 321) + +An similar behavior can also be approximated with +hash: + + +\set size 100 +\set r random_zipfian(1, 100 * :size, 1.07) +\set k 1 + abs(hash(:r)) % :size + + +As this hash-modulo version generates collisions, some +id would not be reachable and others would come more often +than the target distribution. diff --git a/src/bin/pgbench/exprparse.y b/src/bin/pgbench/exprparse.y index 4d529ea550..73514a0a47 100644 --- a/src/bin/pgbench/exprparse.y +++ b/src/bin/pgbench/exprparse.y @@ -19,6 +19,7 @@ #define PGBENCH_NARGS_VARIABLE (-1) #define PGBENCH_NARGS_CASE (-2) #define PGBENCH_NARGS_HASH (-3) +#define PGBENCH_NARGS_PERMUTE (-4) PgBenchExpr *expr_parse_result; @@ -370,6 +371,9 @@ static const struct { "hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A }, + { + "permute", PGBENCH_NARGS_PERMUTE, PGBENCH_PERMUTE + }, /* keep as last array element */ { NULL, 0, 0 @@ -482,6 +486,19 @@ make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args) } break; + /* pseudo-random permutation function with optional seed argument */ + case PGBENCH_NARGS_PERMUTE: + if (len < 2 || len > 3) +expr_yyerror_more(yyscanner, "unexpected number of arguments"
Re: psql - add SHOW_ALL_RESULTS option
Hello Peter, This reply was two months ago, and nothing has happened, so I have marked the patch as RwF. Given the ongoing work on returning multiple result sets from stored procedures[0], I went to dust off this patch. Based on the feedback, I put back the titular SHOW_ALL_RESULTS option, but set the default to on. I fixed the test failure in 013_crash_restart.pl. I also trimmed back the test changes a bit so that the resulting test output changes are visible better. (We could make those stylistic changes separately in another patch.) I'll put this back into the commitfest for another look. Thanks a lot for the fixes and pushing it forward! My 0.02€: I tested this updated version and do not have any comment on this version. From my point of view it could be committed. I would not bother to separate the test style ajustments. -- Fabien.
Re: PITR promote bug: Checkpointer writes to older timeline
On Thu, Mar 04, 2021 at 05:10:36PM +0900, Kyotaro Horiguchi wrote: > read_local_xlog_page is *designed* to maintain ThisTimeLineID. > Currently it doesn't seem utilized but I think it's sufficiently > reasonable that the function maintains ThisTimeLineID. I don't quite follow this line of thoughts. ThisTimeLineID is designed to remain 0 while recovery is running in most processes (at the close exception of a WAL sender with a cascading setup, physical or logical, of course), so why is there any business for read_local_xlog_page() to touch this field at all while in recovery to begin with? I equally find confusing that XLogReadDetermineTimeline() relies on a specific value of ThisTimeLineID in its own logic, while it clearly states that all its callers have to read the current active TLI beforehand. So I think that the existing logic is pretty weak, and that resetting the field is an incorrect approach? It seems to me that we had better not change ThisTimeLineID actively while in recovery in this code path and just let others do the job, like RecoveryInProgress() once recovery finishes, or GetStandbyFlushRecPtr() for a WAL sender. And finally, we should store the current TLI used for replay in a separate variable that gets passed down to the XLogReadDetermineTimeline() as argument. While going through it, I have simplified a bit the proposed TAP tests (thanks for replacing the sleep() call, Soumyadeep. This would have made the test slower for nothing on fast machines, and it would cause failures on very slow machines). The attached fixes the original issue for me, keeping all the records in their correct timeline. And I have not been able to break cascading setups. If it happens that such cases actually break, we have holes in our existing test coverage that should be improved. I cannot see anything fancy missing on this side, though. Any thoughts? -- Michael diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h index 9ac602b674..34d6286d75 100644 --- a/src/include/access/xlogutils.h +++ b/src/include/access/xlogutils.h @@ -56,7 +56,9 @@ extern void wal_segment_open(XLogReaderState *state, extern void wal_segment_close(XLogReaderState *state); extern void XLogReadDetermineTimeline(XLogReaderState *state, - XLogRecPtr wantPage, uint32 wantLength); + XLogRecPtr wantPage, + uint32 wantLength, + TimeLineID readtli); extern void WALReadRaiseError(WALReadError *errinfo); diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index d17d660f46..9695653bc2 100644 --- a/src/backend/access/transam/xlogutils.c +++ b/src/backend/access/transam/xlogutils.c @@ -658,6 +658,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum, * wantLength to the amount of the page that will be read, up to * XLOG_BLCKSZ. If the amount to be read isn't known, pass XLOG_BLCKSZ. * + * readtli is the current timeline as found by the caller of this routine. + * When not in recovery, this should be ThisTimeLineID. As ThisTimeLineID + * remains set to 0 for most processes while in recovery, the caller ought + * to provide the timeline number given as a result of GetXLogReplayRecPtr() + * instead (for a WAL sender this would actually be ThisTimeLineID as the + * field gets updated in a cascading WAL sender). + * * We switch to an xlog segment from the new timeline eagerly when on a * historical timeline, as soon as we reach the start of the xlog segment * containing the timeline switch. The server copied the segment to the new @@ -679,12 +686,13 @@ XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum, * * The caller must also make sure it doesn't read past the current replay * position (using GetXLogReplayRecPtr) if executing in recovery, so it - * doesn't fail to notice that the current timeline became historical. The - * caller must also update ThisTimeLineID with the result of - * GetXLogReplayRecPtr and must check RecoveryInProgress(). + * doesn't fail to notice that the current timeline became historical. */ void -XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength) +XLogReadDetermineTimeline(XLogReaderState *state, + XLogRecPtr wantPage, + uint32 wantLength, + TimeLineID readtli) { const XLogRecPtr lastReadPage = (state->seg.ws_segno * state->segcxt.ws_segsize + state->segoff); @@ -712,12 +720,12 @@ XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wa * just carry on. (Seeking backwards requires a check to make sure the * older page isn't on a prior timeline). * - * ThisTimeLineID might've become historical since we last looked, but the + * readtli might've become historical since we last looked, but the * caller is required not to read past the flush limit it saw at the time * it looked up the timeline. There's nothing we can do about it if * StartupXLOG() rename
Re: Using COPY FREEZE in pgbench
Hello, After giving it some thought, ISTM that there could also be a performance improvement with copy freeze from older version, so I'd suggest to add it after 9.3 where the option was added, i.e. 90300. You misunderstand about COPY FREEZE. Pre-13 COPY FREEZE does not contribute a performance improvement. See discussions for more details. https://www.postgresql.org/message-id/camku%3d1w3osjj2fneelhhnrlxfzitdgp9fphee08nt2fqfmz...@mail.gmail.com https://www.postgresql.org/message-id/flat/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg%40mail.gmail.com Ok. ISTM that the option should be triggered as soon as it is available, but you do as you wish. Also, I do not think it is worth to fail on a 0 pqserverversion, just keep the number and consider it a very old version? If pqserverversion fails, then following PQ calls are likely fail too. What's a benefit to continue after pqserverversion returns 0? I'm unsure how this could happen ever. The benefit of not caring is less lines of codes in pgbench and a later call will catch the issue anyway, if libpq is corrupted. Question: should there be a word about copy using the freeze option? I'd say yes, in the section describing "g". Can you elaborate about "section describing "g"? I am not sure what you mean. The "g" item in the section describing initialization steps (i.e. option -I). I'd suggest just to replace "COPY" with "COPY FREEZE" in the sentence. -- Fabien.
Re: REINDEX backend filtering
On Wed, Mar 03, 2021 at 01:56:59PM +0800, Julien Rouhaud wrote: > > Please find attached v5 which address all previous comments: > > - consistently use "outdated" > - use REINDEX (OUTDATED) grammar (with a new unreserved OUTDATED keyword) > - new --outdated option to reindexdb > - expose a new "pg_index_has_outdated_dependency(regclass)" SQL function > - use that function in reindexdb --outdated to sort tables by total > indexes-to-be-processed size v6 attached, rebase only due to conflict with recent commit. >From d2e05e6f64c88b0d5074b9963586bf6999276762 Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Thu, 3 Dec 2020 15:54:42 +0800 Subject: [PATCH v6 1/2] Add a new OUTDATED filtering facility for REINDEX command. OUTDATED is added a new unreserved keyword. When used, REINDEX will only process indexes that have an outdated dependency. For now, only dependency on collations are supported but we'll likely support other kind of dependency in the future. Also add a new pg_index_has_outdated_dependency(regclass) SQL function, so client code can filter such indexes if needed. This function will also be used in a following commit to teach reindexdb to use this new OUTDATED option and order the tables by the amount of work that will actually be done. Catversion (should be) bumped. Author: Julien Rouhaud Reviewed-by: Discussion: https://postgr.es/m/20201203093143.GA64934%40nol --- doc/src/sgml/func.sgml | 27 -- doc/src/sgml/ref/reindex.sgml | 12 +++ src/backend/catalog/index.c| 107 - src/backend/commands/indexcmds.c | 12 ++- src/backend/parser/gram.y | 4 +- src/backend/utils/cache/relcache.c | 40 src/bin/psql/tab-complete.c| 2 +- src/include/catalog/index.h| 3 + src/include/catalog/pg_proc.dat| 4 + src/include/parser/kwlist.h| 1 + src/include/utils/relcache.h | 1 + src/test/regress/expected/create_index.out | 10 ++ src/test/regress/sql/create_index.sql | 10 ++ 13 files changed, 221 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 9492a3c6b9..0eda6678ac 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -26448,12 +26448,13 @@ SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size shows the functions -available for index maintenance tasks. (Note that these maintenance -tasks are normally done automatically by autovacuum; use of these -functions is only required in special cases.) -These functions cannot be executed during recovery. -Use of these functions is restricted to superusers and the owner -of the given index. +available for index maintenance tasks. (Note that the maintenance +tasks performing actions on indexes are normally done automatically by +autovacuum; use of these functions is only required in special cases.) +The functions performing actions on indexes cannot be executed during +recovery. +Use of the functions performing actions on indexes is restricted to +superusers and the owner of the given index. @@ -26538,6 +26539,20 @@ SELECT pg_size_pretty(sum(pg_relation_size(relid))) AS total_size option. + + + + + pg_index_has_outdated_dependency + +pg_index_has_outdated_dependency ( index regclass ) +boolean + + +Check if the specified index has any outdated dependency. For now only +dependency on collations are supported. + + diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index ff4dba8c36..aa66e6461f 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -26,6 +26,7 @@ REINDEX [ ( option [, ...] ) ] { IN where option can be one of: CONCURRENTLY [ boolean ] +OUTDATED [ boolean ] TABLESPACE new_tablespace VERBOSE [ boolean ] @@ -188,6 +189,17 @@ REINDEX [ ( option [, ...] ) ] { IN + +OUTDATED + + + This option can be used to filter the list of indexes to rebuild and only + process indexes that have outdated dependencies. Fow now, the only + handle dependency is for the collation provider version. + + + + TABLESPACE diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index 4ef61b5efd..571feac5db 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -100,6 +100,12 @@ typedef struct Oid pendingReindexedIndexes[FLEXIBLE_ARRAY_MEMBER]; } SerializedReindexState; +typedef struct +{ + Oid relid; /* targetr index oid */ + bool outdated; /* depends on at least on deprected collation? */ +} IndexHasOutdatedColl; + /* non-export function prototypes */ static bool relationHasPrimaryKey(
Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Recent conflict, thanks to cfbot. v18 attached. >From fa94eba58ee0ca098cfde0d17de72dc230ee471c Mon Sep 17 00:00:00 2001 From: Julien Rouhaud Date: Wed, 14 Oct 2020 02:11:37 +0800 Subject: [PATCH v18 1/3] Move pg_stat_statements query jumbling to core. A new compute_queryid GUC is also added, to control whether the queryid should be computed. It's now possible to disable core queryid computation and use pg_stat_statements with a different algorithm to compute the queryid by using third-party module. Author: Julien Rouhaud Reviewed-by: Discussion: https://postgr.es/m/CA+8PKvQnMfOE-c3YLRwxOsCYXQDyP8VXs6CDtMZp1V4=d4l...@mail.gmail.com --- .../pg_stat_statements/pg_stat_statements.c | 805 + .../pg_stat_statements.conf | 1 + doc/src/sgml/config.sgml | 18 + src/backend/parser/analyze.c | 14 +- src/backend/tcop/postgres.c | 6 +- src/backend/utils/misc/Makefile | 1 + src/backend/utils/misc/guc.c | 10 + src/backend/utils/misc/postgresql.conf.sample | 1 + src/backend/utils/misc/queryjumble.c | 834 ++ src/include/parser/analyze.h | 4 +- src/include/utils/guc.h | 1 + src/include/utils/queryjumble.h | 58 ++ 12 files changed, 969 insertions(+), 784 deletions(-) create mode 100644 src/backend/utils/misc/queryjumble.c create mode 100644 src/include/utils/queryjumble.h diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index 62cccbfa44..99bc7184cb 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -8,24 +8,9 @@ * a shared hashtable. (We track only as many distinct queries as will fit * in the designated amount of shared memory.) * - * As of Postgres 9.2, this module normalizes query entries. Normalization - * is a process whereby similar queries, typically differing only in their - * constants (though the exact rules are somewhat more subtle than that) are - * recognized as equivalent, and are tracked as a single entry. This is - * particularly useful for non-prepared queries. - * - * Normalization is implemented by fingerprinting queries, selectively - * serializing those fields of each query tree's nodes that are judged to be - * essential to the query. This is referred to as a query jumble. This is - * distinct from a regular serialization in that various extraneous - * information is ignored as irrelevant or not essential to the query, such - * as the collations of Vars and, most notably, the values of constants. - * - * This jumble is acquired at the end of parse analysis of each query, and - * a 64-bit hash of it is stored into the query's Query.queryId field. - * The server then copies this value around, making it available in plan - * tree(s) generated from the query. The executor can then use this value - * to blame query costs on the proper queryId. + * As of Postgres 9.2, this module normalizes query entries. As of Postgres + * 14, the normalization is done by the core, if compute_queryid is enabled, or + * by third-party modules if enabled. * * To facilitate presenting entries to users, we create "representative" query * strings in which constants are replaced with parameter symbols ($n), to @@ -114,8 +99,6 @@ static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ #define IS_STICKY(c) ((c.calls[PGSS_PLAN] + c.calls[PGSS_EXEC]) == 0) -#define JUMBLE_SIZE1024 /* query serialization buffer size */ - /* * Extension version number, for supporting older extension versions' objects */ @@ -235,40 +218,6 @@ typedef struct pgssSharedState pgssGlobalStats stats; /* global statistics for pgss */ } pgssSharedState; -/* - * Struct for tracking locations/lengths of constants during normalization - */ -typedef struct pgssLocationLen -{ - int location; /* start offset in query text */ - int length; /* length in bytes, or -1 to ignore */ -} pgssLocationLen; - -/* - * Working state for computing a query jumble and producing a normalized - * query string - */ -typedef struct pgssJumbleState -{ - /* Jumble of current query tree */ - unsigned char *jumble; - - /* Number of bytes used in jumble[] */ - Size jumble_len; - - /* Array of locations of constants that should be removed */ - pgssLocationLen *clocations; - - /* Allocated length of clocations array */ - int clocations_buf_size; - - /* Current number of valid entries in clocations array */ - int clocations_count; - - /* highest Param id we've seen, in order to start normalization correctly */ - int highest_extern_param_id; -} pgssJumbleState; - /* Local variables */ /* Current nesting depth of ExecutorRun+ProcessUtility calls */ @@ -342,7 +291,8 @@ PG_FUNCTION_INFO_V1(pg_stat_sta