Re: Use incremental sort paths for window functions

2020-09-15 Thread David Rowley
On Tue, 15 Sep 2020 at 05:19, Tomas Vondra  wrote:
>
> On Wed, Jul 08, 2020 at 04:57:21PM +1200, David Rowley wrote:
> >Over on [1] someone was asking about chained window paths making use
> >of already partially sorted input.  (The thread is on -general, so I
> >guessed they're not using PG13.)
> >
> >However, On checking PG13 to see if incremental sort would help their
> >case, I saw it didn't. Looking at the code I saw that
> >create_window_paths() and create_one_window_path() don't make any use
> >of incremental sort paths.
> >
> >I quickly put together the attached. It's only about 15 mins of work,
> >but it seems worth looking at a bit more for some future commitfest.
> >Yeah, I'll need to add some tests as I see nothing failed by changing
> >this.
> >
>
> Yeah, I'm sure there are a couple other places that might benefit from
> incremental sort but were not included in the PG13 commit. The patch
> seems correct - did it help in the reported thread? How much?

Looks like I didn't mention the idea on the thread.  I must have felt
it was just too many steps away from being very useful to mention it
in the -general thread.

I suppose it'll help similar to any use case for incremental sort;
lots in some and less so in others. It'll mostly depend on how big
each incremental sort is.  e.g order by a,b when there's only an index
on (a) will be pretty good if a is unique.  Each sort will be over
quite fast. If there are a million rows for each value of a then
incremental sort would be less favourable

> I suppose this might benefit from an optimization similar to the GROUP
> BY reordering discussed in [1]. For example, with
>
> max(a) over (partition by b,c)
>
> I think we could use index on (c) and consider incremental sort by c,b,
> i.e. with the inverted pathkeys. But that's a completely independent
> topic, I believe.

I've only vaguely followed that. Sounds like interesting work, but I
agree that it's not related to this.

Thanks for having a look at this.

David




Re: Use incremental sort paths for window functions

2020-09-15 Thread Daniel Gustafsson
> On 15 Sep 2020, at 01:17, David Rowley  wrote:
> 
> On Tue, 15 Sep 2020 at 00:02, Daniel Gustafsson  wrote:
>> 
>>> On 8 Jul 2020, at 06:57, David Rowley  wrote:
>>> 
>>> Over on [1] someone was asking about chained window paths making use
>>> of already partially sorted input.  (The thread is on -general, so I
>>> guessed they're not using PG13.)
>> 
>> The [1] reference wasn't qualified, do you remember which thread it was?
> 
> That was sloppy of me.  It's
> https://www.postgresql.org/message-id/CADd42iFZWwYNsXjEM_3HWK3QnfiCrMNmpOkZqyBQCabnVxOPtw%40mail.gmail.com

Thanks!

>>> However, On checking PG13 to see if incremental sort would help their
>>> case, I saw it didn't. Looking at the code I saw that
>>> create_window_paths() and create_one_window_path() don't make any use
>>> of incremental sort paths.
>> 
>> Commit 728202b63cdcd7f counteracts this optimization in part since it orders
>> the windows such that the longest common prefix is executed first to allow
>> subsequent windows to skip sorting entirely.
> 
> This would have been clearer if I'd remembered to include the link to
> the thread.  The thread talks about sorting requirements like c1, c3
> then c1, c4. So it can make use of the common prefix and do
> incremental sorts.
> 
> It sounds like you're talking about cases like: wfunc() over (order by
> a), wfunc2() over (order by a,b). Where we can just sort on a,b and
> have that order work for the first wfunc(). That's a good optimisation
> but does not work for the above case.

Right, the combination of these two optimizations will however work well
together for quite a few cases.

On that note, assume we have the below scenario:

wfunc .. (order by a), .. (order by a,b), .. (order by a,b,c)

Currently the windows will be ordered such that a,b,c is sorted first, with a,b
and a not having to sort.  I wonder if there is a good heuristic to find cases
where sorting a, then a,b incrementally and finally a,b,c incrementally is
cheaper than a big sort of a,b,c?  If a,b,c would spill but subsequent
incremental sorts won't then perhaps that could be a case?  Not sure if it's
worth the planner time, just thinking out loud.

>> A few comments on the patch: there is no check for enable_incremental_sort, 
>> and
>> it lacks tests (as already mentioned) for the resulting plan.
> 
> Yeah, it should be making sure enable_incremental_sort is on for sure.
> I've attached another version with a few tests added too.

No comments on this version, LGTM.

cheers ./daniel



Re: New statistics for tuning WAL buffer size

2020-09-15 Thread Fujii Masao




On 2020/09/15 15:52, Masahiro Ikeda wrote:

On 2020-09-11 01:40, Fujii Masao wrote:

On 2020/09/09 13:57, Masahiro Ikeda wrote:

On 2020-09-07 16:19, Fujii Masao wrote:

On 2020/09/07 9:58, Masahiro Ikeda wrote:

Thanks for the review and advice!

On 2020-09-03 16:05, Fujii Masao wrote:

On 2020/09/02 18:56, Masahiro Ikeda wrote:

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.
../../src/include/pgstat.h:761:1: warning: '/*' within block comment [-Wcomment]


Thanks for the comment. I fixed.


Thanks for the fix! But why are those comments necessary?


Sorry about that. This comment is not necessary.
I removed it.


The pg_stat_walwriter is not security restricted now, so ordinary users can 
access it.
It has the same security level as pg_stat_archiver. If you have any comments, 
please let me know.


+   dirty_writes bigint

I guess that the column name "dirty_writes" derived from
the DTrace probe name. Isn't this name confusing? We should
rename it to "wal_buffers_full" or something?


I agree and rename it to "wal_buffers_full".


+/* --
+ * PgStat_MsgWalWriter    Sent by the walwriter to update statistics.

This comment seems not accurate because backends also send it.

+/*
+ * WAL writes statistics counter is updated in XLogWrite function
+ */
+extern PgStat_MsgWalWriter WalWriterStats;

This comment seems not right because the counter is not updated in XLogWrite().


Right. I fixed it to "Sent by each backend and background workers to update WAL 
statistics."
In the future, other statistics will be included so I remove the function's 
name.



+-- There will surely and maximum one record
+select count(*) = 1 as ok from pg_stat_walwriter;

What about changing this comment to "There must be only one record"?


Thanks, I fixed.


+    WalWriterStats.m_xlog_dirty_writes++;
 LWLockRelease(WALWriteLock);

Since WalWriterStats.m_xlog_dirty_writes doesn't need to be protected
with WALWriteLock, isn't it better to increment that after releasing the lock?


Thanks, I fixed.


+CREATE VIEW pg_stat_walwriter AS
+    SELECT
+    pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+    pg_stat_get_walwriter_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_vacuum AS

In system_views.sql, the definition of pg_stat_walwriter should be
placed just after that of pg_stat_bgwriter not pg_stat_progress_analyze.


OK, I fixed it.


 }
-
 /*
  * We found an existing collector stats file. Read it and put all the

You seem to accidentally have removed the empty line here.


Sorry about that. I fixed it.


- errhint("Target must be \"archiver\" or \"bgwriter\".")));
+ errhint("Target must be \"archiver\" or \"bgwriter\" or
\"walwriter\".")));

There are two "or" in the message, but the former should be replaced with ","?


Thanks, I fixed.


On 2020-09-05 18:40, Magnus Hagander wrote:

On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao
 wrote:


On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

I changed the view name from pg_stat_walwrites to

pg_stat_walwriter.

I think it is better to match naming scheme with other views

like

pg_stat_bgwriter,

which is for bgwriter statistics but it has the statistics

related to backend.


I prefer the view name pg_stat_walwriter for the consistency with
other view names. But we also have pg_stat_wal_receiver. Which
makes me think that maybe pg_stat_wal_writer is better for
the consistency. Thought? IMO either of them works for me.
I'd like to hear more opinons about this.


I think pg_stat_bgwriter is now a misnomer, because it contains

the backends' activity.  Likewise, pg_stat_walwriter leads to
misunderstanding because its information is not limited to WAL
writer.


How about simply pg_stat_wal?  In the future, we may want to

include WAL reads in this view, e.g. reading undo logs in zheap.

Sounds reasonable.


+1.

pg_stat_bgwriter has had the "wrong name" for quite some time now --
it became even more apparent when the checkpointer was split out to
it's own process, and that's not exactly a recent change. And it had
allocs in it from day one...

I think naming it for what the data in it is ("wal") rather than which
process deals with it ("walwriter") is correct, unless the statistics
can be known to only *ever* affect one type of process. (And then
different processes can affect different columns in the view). As a
general rule -- and that's from what I can tell exactly what's being
proposed.


Thanks for your comments. I agree with your opinions.
I changed the view name to "pg_stat_wal".


I fixed the code to send the WAL statistics from not only backend and walwriter
but also checkpointer, walsender and autovacuum worker.


Good point! Thanks for updating the patch!


@@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, 

Re: Parallelize stream replication process

2020-09-15 Thread Fujii Masao




On 2020/09/15 13:41, Bharath Rupireddy wrote:

On Tue, Sep 15, 2020 at 9:27 AM Li Japin  wrote:


For now, postgres use single process to send, receive and replay the WAL when 
we use stream replication,
is there any point to parallelize this process? If it does, how do we start?

Any thoughts?


Probably this is another parallelism than what you're thinking,
but I was thinking to start up walwriter process in the standby server
and make it fsync the streamed WAL data. This means that we leave
a part of tasks of walreceiver process to walwriter. Walreceiver
performs WAL receive and write, and walwriter performs WAL flush,
in parallel. I'm just expecting that this change would improve
the replication performance, e.g., reduce the time to wait for
sync replication.

Without this change (i.e., originally), only walreceiver performs
WAL receive, write and flush. So wihle walreceiver is fsyncing WAL data,
it cannot receive newly-arrived WAL data. If WAL flush takes a time,
which means that the time to wait for sync replication in the primary
would be enlarged.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: pg_restore causing deadlocks on partitioned tables

2020-09-15 Thread Amit Langote
On Tue, Sep 15, 2020 at 9:09 AM Tom Lane  wrote:
> I wrote:
> >> (2) ALTER TABLE ONLY ... ADD CONSTRAINT on a partition root tries to get
> >> AccessExclusiveLock on all child partitions, despite the ONLY.
>
> > The cause of this seems to be that ATPrepSetNotNull is too dumb to
> > avoid recursing to all the child tables when the parent is already
> > attnotnull.  Or is there a reason we have to recurse anyway?
>
> I wrote a quick patch for this part.  It seems pretty safe and probably
> could be back-patched without fear.

The patch changes existing behavior in one case as shown below:

drop table if exists foo cascade;
create table foo (a int not null);
create table child () inherits (foo);
alter table child alter a drop not null;
alter table foo alter a set not null;
insert into child select null;

Currently, the last statement gives a "not null violated" error, but
no longer with the patch, because the alter statement before that now
finishes without setting NOT NULL in child.

The patch's theory that if the parent column has NOT NULL set then it
must be set in child tables too does not actually hold for plain
inheritance cases, because as shown above, NOT NULL can be dropped in
children independently of the parent.  With partitioning, dropping NOT
NULL from child partitions is prevented:

drop table if exists bar cascade;
create table bar (a int) partition by list (a);
create table bar1 partition of bar (a not null) for values in (1);
alter table bar1 alter a drop not null;
ERROR:  column "a" is marked NOT NULL in parent table

>  (I also noticed that
> ATSimpleRecursion is being unnecessarily stupid: instead of the
> demonstrably not-future-proof relkind check, it could test relhassubclass,
> which is not only simpler and less likely to need future changes, but
> is able to save a scan of pg_inherits in a lot more cases.)

+1

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com




Re: Allow ERROR from heap_prepare_freeze_tuple to be downgraded to WARNING

2020-09-15 Thread Dilip Kumar
On Tue, Sep 15, 2020 at 11:14 AM Andres Freund  wrote:
>
> On 2020-09-15 10:54:29 +0530, Dilip Kumar wrote:
> > What problem do you see if we set xmax to the InvalidTransactionId and
> > HEAP_XMAX_INVALID flag in the infomask ?
>
> 1) It'll make a dead tuple appear live. You cannot do this for tuples
>with an xid below the horizon.

How is it possible?  Because tuple which has a committed xmax and the
xmax is older than the oldestXmin, should not come for freezing unless
it is lock_only xid (because those tuples are already gone).  So if
the xmax is smaller than the cutoff xid than either it is lock_only or
it is aborted.  If the XMAX is lock only then I don't see any issue
OTOH if it is aborted xid and if it is already smaller than the
cut-off xid then it is anyway live tuple.

>2) it'll break HOT chain following / indexes.

If my above theory in point 1 is correct then I don't see this issue as well.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com




Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Peter Geoghegan
On Mon, Sep 14, 2020 at 11:52 PM Jeff Davis  wrote:
> On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote:
> > We still need to put the reliance on ltsWriteBlock() allocating many
> > blocks before they've been logically written on some kind of formal
> > footing for Postgres 13 -- it is now possible that an all-zero block
> > will be left behind even after we're done writing and have flushed
> > all
> > temp buffers, which is a new thing.
>
> Is the current direction of this thread (i.e. the two posted patches)
> addressing your concern here?

Yes.

-- 
Peter Geoghegan




Re: New statistics for tuning WAL buffer size

2020-09-15 Thread Masahiro Ikeda

On 2020-09-11 01:40, Fujii Masao wrote:

On 2020/09/09 13:57, Masahiro Ikeda wrote:

On 2020-09-07 16:19, Fujii Masao wrote:

On 2020/09/07 9:58, Masahiro Ikeda wrote:

Thanks for the review and advice!

On 2020-09-03 16:05, Fujii Masao wrote:

On 2020/09/02 18:56, Masahiro Ikeda wrote:

+/* --
+ * Backend types
+ * --

You seem to forget to add "*/" into the above comment.
This issue could cause the following compiler warning.
../../src/include/pgstat.h:761:1: warning: '/*' within block 
comment [-Wcomment]


Thanks for the comment. I fixed.


Thanks for the fix! But why are those comments necessary?


Sorry about that. This comment is not necessary.
I removed it.

The pg_stat_walwriter is not security restricted now, so ordinary 
users can access it.
It has the same security level as pg_stat_archiver. If you have 
any comments, please let me know.


+   dirty_writes bigint

I guess that the column name "dirty_writes" derived from
the DTrace probe name. Isn't this name confusing? We should
rename it to "wal_buffers_full" or something?


I agree and rename it to "wal_buffers_full".


+/* --
+ * PgStat_MsgWalWriter    Sent by the walwriter to update 
statistics.


This comment seems not accurate because backends also send it.

+/*
+ * WAL writes statistics counter is updated in XLogWrite function
+ */
+extern PgStat_MsgWalWriter WalWriterStats;

This comment seems not right because the counter is not updated in 
XLogWrite().


Right. I fixed it to "Sent by each backend and background workers to 
update WAL statistics."
In the future, other statistics will be included so I remove the 
function's name.




+-- There will surely and maximum one record
+select count(*) = 1 as ok from pg_stat_walwriter;

What about changing this comment to "There must be only one 
record"?


Thanks, I fixed.


+    WalWriterStats.m_xlog_dirty_writes++;
 LWLockRelease(WALWriteLock);

Since WalWriterStats.m_xlog_dirty_writes doesn't need to be 
protected
with WALWriteLock, isn't it better to increment that after 
releasing the lock?


Thanks, I fixed.


+CREATE VIEW pg_stat_walwriter AS
+    SELECT
+    pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+    pg_stat_get_walwriter_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_stat_progress_vacuum AS

In system_views.sql, the definition of pg_stat_walwriter should be
placed just after that of pg_stat_bgwriter not 
pg_stat_progress_analyze.


OK, I fixed it.


 }
-
 /*
  * We found an existing collector stats file. Read it and put 
all the


You seem to accidentally have removed the empty line here.


Sorry about that. I fixed it.

- errhint("Target must be \"archiver\" or 
\"bgwriter\".")));
+ errhint("Target must be \"archiver\" or 
\"bgwriter\" or

\"walwriter\".")));

There are two "or" in the message, but the former should be 
replaced with ","?


Thanks, I fixed.


On 2020-09-05 18:40, Magnus Hagander wrote:

On Fri, Sep 4, 2020 at 5:42 AM Fujii Masao
 wrote:


On 2020/09/04 11:50, tsunakawa.ta...@fujitsu.com wrote:

From: Fujii Masao 

I changed the view name from pg_stat_walwrites to

pg_stat_walwriter.

I think it is better to match naming scheme with other views

like

pg_stat_bgwriter,

which is for bgwriter statistics but it has the statistics

related to backend.


I prefer the view name pg_stat_walwriter for the consistency 
with

other view names. But we also have pg_stat_wal_receiver. Which
makes me think that maybe pg_stat_wal_writer is better for
the consistency. Thought? IMO either of them works for me.
I'd like to hear more opinons about this.


I think pg_stat_bgwriter is now a misnomer, because it contains

the backends' activity.  Likewise, pg_stat_walwriter leads to
misunderstanding because its information is not limited to WAL
writer.


How about simply pg_stat_wal?  In the future, we may want to

include WAL reads in this view, e.g. reading undo logs in zheap.

Sounds reasonable.


+1.

pg_stat_bgwriter has had the "wrong name" for quite some time now 
--

it became even more apparent when the checkpointer was split out to
it's own process, and that's not exactly a recent change. And it 
had

allocs in it from day one...

I think naming it for what the data in it is ("wal") rather than 
which
process deals with it ("walwriter") is correct, unless the 
statistics

can be known to only *ever* affect one type of process. (And then
different processes can affect different columns in the view). As a
general rule -- and that's from what I can tell exactly what's 
being

proposed.


Thanks for your comments. I agree with your opinions.
I changed the view name to "pg_stat_wal".


I fixed the code to send the WAL statistics from not only backend 
and walwriter

but also checkpointer, walsender and autovacuum worker.


Good point! Thanks for updating the patch!


@@ -604,6 +604,7 @@ heap_vacuum_rel(Relation onerel, VacuumParams 
*params,

  

Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Jeff Davis
On Thu, 2020-09-10 at 18:42 -0700, Peter Geoghegan wrote:
> We still need to put the reliance on ltsWriteBlock() allocating many
> blocks before they've been logically written on some kind of formal
> footing for Postgres 13 -- it is now possible that an all-zero block
> will be left behind even after we're done writing and have flushed
> all
> temp buffers, which is a new thing.

Is the current direction of this thread (i.e. the two posted patches)
addressing your concern here?

Regards,
Jeff Davis






Re: logtape.c stats don't account for unused "prefetched" block numbers

2020-09-15 Thread Jeff Davis
On Mon, 2020-09-14 at 20:48 -0700, Peter Geoghegan wrote:
> On Mon, Sep 14, 2020 at 8:07 PM Jeff Davis  wrote:
> > Sure. Will backporting either patch into REL_13_STABLE now
> > interfere
> > with RC1 release in any way?
> 
> The RMT will discuss this.
> 
> It would help if there was a patch ready to go.

Attached. HashAgg seems to accurately reflect the filesize, as does
Sort.

Regards,
Jeff Davis

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 28802e6588d..75e5bbf209d 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2704,8 +2704,8 @@ agg_refill_hash_table(AggState *aggstate)
 
 	if (spill_initialized)
 	{
-		hash_agg_update_metrics(aggstate, true, spill.npartitions);
 		hashagg_spill_finish(aggstate, , batch->setno);
+		hash_agg_update_metrics(aggstate, true, spill.npartitions);
 	}
 	else
 		hash_agg_update_metrics(aggstate, true, 0);
diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index e904ce55185..71c8468b494 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -1262,9 +1262,19 @@ LogicalTapeTell(LogicalTapeSet *lts, int tapenum,
 
 /*
  * Obtain total disk space currently used by a LogicalTapeSet, in blocks.
+ *
+ * This should not be called while actively writing to tapes, otherwise it may
+ * not account for buffered data.
  */
 long
 LogicalTapeSetBlocks(LogicalTapeSet *lts)
 {
-	return lts->nBlocksAllocated - lts->nHoleBlocks;
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < lts->nTapes; i++)
+	{
+		LogicalTape *lt = >tapes[i];
+		Assert(!lt->writing || lt->buffer == NULL);
+	}
+#endif
+	return lts->nBlocksWritten - lts->nHoleBlocks;
 }


<    1   2