Re: Use incremental sort paths for window functions
On Tue, 15 Sep 2020 at 05:19, Tomas Vondra wrote: > > On Wed, Jul 08, 2020 at 04:57:21PM +1200, David Rowley wrote: > >Over on [1] someone was asking about chained window paths making use > >of already partially sorted input. (The thread is on -general, so I > >guessed they're not using PG13.) > > > >However, On checking PG13 to see if incremental sort would help their > >case, I saw it didn't. Looking at the code I saw that > >create_window_paths() and create_one_window_path() don't make any use > >of incremental sort paths. > > > >I quickly put together the attached. It's only about 15 mins of work, > >but it seems worth looking at a bit more for some future commitfest. > >Yeah, I'll need to add some tests as I see nothing failed by changing > >this. > > > > Yeah, I'm sure there are a couple other places that might benefit from > incremental sort but were not included in the PG13 commit. The patch > seems correct - did it help in the reported thread? How much? Looks like I didn't mention the idea on the thread. I must have felt it was just too many steps away from being very useful to mention it in the -general thread. I suppose it'll help similar to any use case for incremental sort; lots in some and less so in others. It'll mostly depend on how big each incremental sort is. e.g order by a,b when there's only an index on (a) will be pretty good if a is unique. Each sort will be over quite fast. If there are a million rows for each value of a then incremental sort would be less favourable > I suppose this might benefit from an optimization similar to the GROUP > BY reordering discussed in [1]. For example, with > > max(a) over (partition by b,c) > > I think we could use index on (c) and consider incremental sort by c,b, > i.e. with the inverted pathkeys. But that's a completely independent > topic, I believe. I've only vaguely followed that. Sounds like interesting work, but I agree that it's not related to this. Thanks for having a look at this. David
Re: Use incremental sort paths for window functions
> On 15 Sep 2020, at 01:17, David Rowley wrote: > > On Tue, 15 Sep 2020 at 00:02, Daniel Gustafsson wrote: >> >>> On 8 Jul 2020, at 06:57, David Rowley wrote: >>> >>> Over on [1] someone was asking about chained window paths making use >>> of already partially sorted input. (The thread is on -general, so I >>> guessed they're not using PG13.) >> >> The [1] reference wasn't qualified, do you remember which thread it was? > > That was sloppy of me. It's > https://www.postgresql.org/message-id/CADd42iFZWwYNsXjEM_3HWK3QnfiCrMNmpOkZqyBQCabnVxOPtw%40mail.gmail.com Thanks! >>> However, On checking PG13 to see if incremental sort would help their >>> case, I saw it didn't. Looking at the code I saw that >>> create_window_paths() and create_one_window_path() don't make any use >>> of incremental sort paths. >> >> Commit 728202b63cdcd7f counteracts this optimization in part since it orders >> the windows such that the longest common prefix is executed first to allow >> subsequent windows to skip sorting entirely. > > This would have been clearer if I'd remembered to include the link to > the thread. The thread talks about sorting requirements like c1, c3 > then c1, c4. So it can make use of the common prefix and do > incremental sorts. > > It sounds like you're talking about cases like: wfunc() over (order by > a), wfunc2() over (order by a,b). Where we can just sort on a,b and > have that order work for the first wfunc(). That's a good optimisation > but does not work for the above case. Right, the combination of these two optimizations will however work well together for quite a few cases. On that note, assume we have the below scenario: wfunc .. (order by a), .. (order by a,b), .. (order by a,b,c) Currently the windows will be ordered such that a,b,c is sorted first, with a,b and a not having to sort. I wonder if there is a good heuristic to find cases where sorting a, then a,b incrementally and finally a,b,c incrementally is cheaper than a big sort of a,b,c? If a,b,c would spill but subsequent incremental sorts won't then perhaps that could be a case? Not sure if it's worth the planner time, just thinking out loud. >> A few comments on the patch: there is no check for enable_incremental_sort, >> and >> it lacks tests (as already mentioned) for the resulting plan. > > Yeah, it should be making sure enable_incremental_sort is on for sure. > I've attached another version with a few tests added too. No comments on this version, LGTM. cheers ./daniel
Re: New statistics for tuning WAL buffer size
On 2020/09/15 15:52, Masahiro Ikeda wrote: On 2020-09-11 01:40, Fujii Masao wrote: On 2020/09/09 13:57, Masahiro Ikeda wrote: On 2020-09-07 16:19, Fujii Masao wrote: On 2020/09/07 9:58, Masahiro Ikeda wrote: Thanks for the review and advice! On 2020-09-03 16:05, Fujii Masao wrote: On 2020/09/02 18:56, Masahiro Ikeda wrote: +/* -- + * Backend types + * -- You seem to forget to add "*/" into the above comment. This issue could cause the following compiler warning. ../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment] Thanks for the comment. I fixed. Thanks for the fix! But why are those comments necessary? Sorry about that. This comment is not necessary. I removed it. The pg_stat_walwriter is not security restricted now, so ordinary users can access it. It has the same security level as pg_stat_archiver. If you have any comments, please let me know. + dirty_writes bigint I guess that the column name "dirty_writes" derived from the DTrace probe name. Isn't this name confusing? We should rename it to "wal_buffers_full" or something? I agree and rename it to "wal_buffers_full". +/* -- + * PgStat_MsgWalWriter Sent by the walwriter to update statistics. This comment seems not accurate because backends also send it. +/* + * WAL writes statistics counter is updated in XLogWrite function + */ +extern PgStat_MsgWalWriter WalWriterStats; This comment seems not right because the counter is not updated in XLogWrite(). Right. I fixed it to "Sent by each backend and background workers to update WAL statistics." In the future, other statistics will be included so I remove the function's name. +-- There will surely and maximum one record +select count(*) = 1 as ok from pg_stat_walwriter; What about changing this comment to "There must be only one record"? Thanks, I fixed. + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected with WALWriteLock, isn't it better to increment that after releasing the lock? Thanks, I fixed. +CREATE VIEW pg_stat_walwriter AS + SELECT + pg_stat_get_xlog_dirty_writes() AS dirty_writes, + pg_stat_get_walwriter_stat_reset_time() AS stats_reset; + CREATE VIEW pg_stat_progress_vacuum AS In system_views.sql, the definition of pg_stat_walwriter should be placed just after that of pg_stat_bgwriter not pg_stat_progress_analyze. OK, I fixed it. } - /* * We found an existing collector stats file. Read it and put all the You seem to accidentally have removed the empty line here. Sorry about that. I fixed it. - errhint("Target must be \"archiver\" or \"bgwriter\"."))); + errhint("Target must be \"archiver\" or \"bgwriter\" or \"walwriter\"."))); There are two "or" in the message, but the former should be replaced with ","? Thanks, I fixed. On 2020-09-05 18:40, Magnus Hagander wrote: On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao wrote: On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote: From: Fujii Masao I changed the view name from pg_stat_walwrites to pg_stat_walwriter. I think it is better to match naming scheme with other views like pg_stat_bgwriter, which is for bgwriter statistics but it has the statistics related to backend. I prefer the view name pg_stat_walwriter for the consistency with other view names. But we also have pg_stat_wal_receiver. Which makes me think that maybe pg_stat_wal_writer is better for the consistency. Thought? IMO either of them works for me. I'd like to hear more opinons about this. I think pg_stat_bgwriter is now a misnomer, because it contains the backends' activity. Likewise, pg_stat_walwriter leads to misunderstanding because its information is not limited to WAL writer. How about simply pg_stat_wal? In the future, we may want to include WAL reads in this view, e.g. reading undo logs in zheap. Sounds reasonable. +1. pg_stat_bgwriter has had the "wrong name" for quite some time now -- it became even more apparent when the checkpointer was split out to it's own process, and that's not exactly a recent change. And it had allocs in it from day one... I think naming it for what the data in it is ("wal") rather than which process deals with it ("walwriter") is correct, unless the statistics can be known to only *ever* affect one type of process. (And then different processes can affect different columns in the view). As a general rule -- and that's from what I can tell exactly what's being proposed. Thanks for your comments. I agree with your opinions. I changed the view name to "pg_stat_wal". I fixed the code to send the WAL statistics from not only backend and walwriter but also checkpointer, walsender and autovacuum worker. Good point! Thanks for updating the patch! @@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel,
Re: Parallelize stream replication process
On 2020/09/15 13:41, Bharath Rupireddy wrote: On Tue, Sep 15, 2020 at 9:27 AM Li Japin wrote: For now, postgres use single process to send, receive and replay the WAL when we use stream replication, is there any point to parallelize this process? If it does, how do we start? Any thoughts? Probably this is another parallelism than what you're thinking, but I was thinking to start up walwriter process in the standby server and make it fsync the streamed WAL data. This means that we leave a part of tasks of walreceiver process to walwriter. Walreceiver performs WAL receive and write, and walwriter performs WAL flush, in parallel. I'm just expecting that this change would improve the replication performance, e.g., reduce the time to wait for sync replication. Without this change (i.e., originally), only walreceiver performs WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data, it cannot receive newly-arrived WAL data. If WAL flush takes a time, which means that the time to wait for sync replication in the primary would be enlarged. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Re: pg_restore causing deadlocks on partitioned tables
On Tue, Sep 15, 2020 at 9:09 AM Tom Lane wrote: > I wrote: > >> (2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get > >> AccessExclusiveLock on all child partitions, despite the ONLY. > > > The cause of this seems to be that ATPrepSetNotNull is too dumb to > > avoid recursing to all the child tables when the parent is already > > attnotnull. Or is there a reason we have to recurse anyway? > > I wrote a quick patch for this part. It seems pretty safe and probably > could be back-patched without fear. The patch changes existing behavior in one case as shown below: drop table if exists foo cascade; create table foo (a int not null); create table child () inherits (foo); alter table child alter a drop not null; alter table foo alter a set not null; insert into child select null; Currently, the last statement gives a "not null violated" error, but no longer with the patch, because the alter statement before that now finishes without setting NOT NULL in child. The patch's theory that if the parent column has NOT NULL set then it must be set in child tables too does not actually hold for plain inheritance cases, because as shown above, NOT NULL can be dropped in children independently of the parent. With partitioning, dropping NOT NULL from child partitions is prevented: drop table if exists bar cascade; create table bar (a int) partition by list (a); create table bar1 partition of bar (a not null) for values in (1); alter table bar1 alter a drop not null; ERROR: column "a" is marked NOT NULL in parent table > (I also noticed that > ATSimpleRecursion is being unnecessarily stupid: instead of the > demonstrably not-future-proof relkind check, it could test relhassubclass, > which is not only simpler and less likely to need future changes, but > is able to save a scan of pg_inherits in a lot more cases.) +1 -- Amit Langote EnterpriseDB: http://www.enterprisedb.com
Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING
On Tue, Sep 15, 2020 at 11:14 AM Andres Freund wrote: > > On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote: > > What problem do you see if we set xmax to the InvalidTransactionId and > > HEAP_XMAX_INVALID flag in the infomask ? > > 1) It'll make a dead tuple appear live. You cannot do this for tuples >with an xid below the horizon. How is it possible? Because tuple which has a committed xmax and the xmax is older than the oldestXmin, should not come for freezing unless it is lock_only xid (because those tuples are already gone). So if the xmax is smaller than the cutoff xid than either it is lock_only or it is aborted. If the XMAX is lock only then I don't see any issue OTOH if it is aborted xid and if it is already smaller than the cut-off xid then it is anyway live tuple. >2) it'll break HOT chain following / indexes. If my above theory in point 1 is correct then I don't see this issue as well. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: logtape.c stats don't account for unused "prefetched" block numbers
On Mon, Sep 14, 2020 at 11:52 PM Jeff Davis wrote: > On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote: > > We still need to put the reliance on ltsWriteBlock() allocating many > > blocks before they've been logically written on some kind of formal > > footing for Postgres 13 -- it is now possible that an all-zero block > > will be left behind even after we're done writing and have flushed > > all > > temp buffers, which is a new thing. > > Is the current direction of this thread (i.e. the two posted patches) > addressing your concern here? Yes. -- Peter Geoghegan
Re: New statistics for tuning WAL buffer size
On 2020-09-11 01:40, Fujii Masao wrote: On 2020/09/09 13:57, Masahiro Ikeda wrote: On 2020-09-07 16:19, Fujii Masao wrote: On 2020/09/07 9:58, Masahiro Ikeda wrote: Thanks for the review and advice! On 2020-09-03 16:05, Fujii Masao wrote: On 2020/09/02 18:56, Masahiro Ikeda wrote: +/* -- + * Backend types + * -- You seem to forget to add "*/" into the above comment. This issue could cause the following compiler warning. ../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment] Thanks for the comment. I fixed. Thanks for the fix! But why are those comments necessary? Sorry about that. This comment is not necessary. I removed it. The pg_stat_walwriter is not security restricted now, so ordinary users can access it. It has the same security level as pg_stat_archiver. If you have any comments, please let me know. + dirty_writes bigint I guess that the column name "dirty_writes" derived from the DTrace probe name. Isn't this name confusing? We should rename it to "wal_buffers_full" or something? I agree and rename it to "wal_buffers_full". +/* -- + * PgStat_MsgWalWriter Sent by the walwriter to update statistics. This comment seems not accurate because backends also send it. +/* + * WAL writes statistics counter is updated in XLogWrite function + */ +extern PgStat_MsgWalWriter WalWriterStats; This comment seems not right because the counter is not updated in XLogWrite(). Right. I fixed it to "Sent by each backend and background workers to update WAL statistics." In the future, other statistics will be included so I remove the function's name. +-- There will surely and maximum one record +select count(*) = 1 as ok from pg_stat_walwriter; What about changing this comment to "There must be only one record"? Thanks, I fixed. + WalWriterStats.m_xlog_dirty_writes++; LWLockRelease(WALWriteLock); Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected with WALWriteLock, isn't it better to increment that after releasing the lock? Thanks, I fixed. +CREATE VIEW pg_stat_walwriter AS + SELECT + pg_stat_get_xlog_dirty_writes() AS dirty_writes, + pg_stat_get_walwriter_stat_reset_time() AS stats_reset; + CREATE VIEW pg_stat_progress_vacuum AS In system_views.sql, the definition of pg_stat_walwriter should be placed just after that of pg_stat_bgwriter not pg_stat_progress_analyze. OK, I fixed it. } - /* * We found an existing collector stats file. Read it and put all the You seem to accidentally have removed the empty line here. Sorry about that. I fixed it. - errhint("Target must be \"archiver\" or \"bgwriter\"."))); + errhint("Target must be \"archiver\" or \"bgwriter\" or \"walwriter\"."))); There are two "or" in the message, but the former should be replaced with ","? Thanks, I fixed. On 2020-09-05 18:40, Magnus Hagander wrote: On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao wrote: On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote: From: Fujii Masao I changed the view name from pg_stat_walwrites to pg_stat_walwriter. I think it is better to match naming scheme with other views like pg_stat_bgwriter, which is for bgwriter statistics but it has the statistics related to backend. I prefer the view name pg_stat_walwriter for the consistency with other view names. But we also have pg_stat_wal_receiver. Which makes me think that maybe pg_stat_wal_writer is better for the consistency. Thought? IMO either of them works for me. I'd like to hear more opinons about this. I think pg_stat_bgwriter is now a misnomer, because it contains the backends' activity. Likewise, pg_stat_walwriter leads to misunderstanding because its information is not limited to WAL writer. How about simply pg_stat_wal? In the future, we may want to include WAL reads in this view, e.g. reading undo logs in zheap. Sounds reasonable. +1. pg_stat_bgwriter has had the "wrong name" for quite some time now -- it became even more apparent when the checkpointer was split out to it's own process, and that's not exactly a recent change. And it had allocs in it from day one... I think naming it for what the data in it is ("wal") rather than which process deals with it ("walwriter") is correct, unless the statistics can be known to only *ever* affect one type of process. (And then different processes can affect different columns in the view). As a general rule -- and that's from what I can tell exactly what's being proposed. Thanks for your comments. I agree with your opinions. I changed the view name to "pg_stat_wal". I fixed the code to send the WAL statistics from not only backend and walwriter but also checkpointer, walsender and autovacuum worker. Good point! Thanks for updating the patch! @@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
Re: logtape.c stats don't account for unused "prefetched" block numbers
On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote: > We still need to put the reliance on ltsWriteBlock() allocating many > blocks before they've been logically written on some kind of formal > footing for Postgres 13 -- it is now possible that an all-zero block > will be left behind even after we're done writing and have flushed > all > temp buffers, which is a new thing. Is the current direction of this thread (i.e. the two posted patches) addressing your concern here? Regards, Jeff Davis
Re: logtape.c stats don't account for unused "prefetched" block numbers
On Mon, 2020-09-14 at 20:48 -0700, Peter Geoghegan wrote: > On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis wrote: > > Sure. Will backporting either patch into REL_13_STABLE now > > interfere > > with RC1 release in any way? > > The RMT will discuss this. > > It would help if there was a patch ready to go. Attached. HashAgg seems to accurately reflect the filesize, as does Sort. Regards, Jeff Davis diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 28802e6588d..75e5bbf209d 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -2704,8 +2704,8 @@ agg_refill_hash_table(AggState *aggstate) if (spill_initialized) { - hash_agg_update_metrics(aggstate, true, spill.npartitions); hashagg_spill_finish(aggstate, , batch->setno); + hash_agg_update_metrics(aggstate, true, spill.npartitions); } else hash_agg_update_metrics(aggstate, true, 0); diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c index e904ce55185..71c8468b494 100644 --- a/src/backend/utils/sort/logtape.c +++ b/src/backend/utils/sort/logtape.c @@ -1262,9 +1262,19 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum, /* * Obtain total disk space currently used by a LogicalTapeSet, in blocks. + * + * This should not be called while actively writing to tapes, otherwise it may + * not account for buffered data. */ long LogicalTapeSetBlocks(LogicalTapeSet *lts) { - return lts->nBlocksAllocated - lts->nHoleBlocks; +#ifdef USE_ASSERT_CHECKING + for (int i = 0; i < lts->nTapes; i++) + { + LogicalTape *lt = >tapes[i]; + Assert(!lt->writing || lt->buffer == NULL); + } +#endif + return lts->nBlocksWritten - lts->nHoleBlocks; }