automatic analyze: readahead - add "IO read time" log message

2020-10-26 Thread Jakub Wartak
Greetings hackers,

I have I hope interesting observation (and nano patch proposal) on system where 
statistics freshness is a critical factor. Autovacuum/autogathering statistics 
was tuned to be pretty very aggressive:
autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
autovacuum_naptime=1s
autovacuum_max_workers=4

Some critical table partitions are configured with:  
autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=5 to 
force auto-analyze jobs pretty often. The interesting logs are like this:
automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 23.01 
s, elapsed: 252.14 s
automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 22.63 
s, elapsed: 317.33 s
automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 21.43 
s, elapsed: 213.58 s
automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 24.49 
s, elapsed: 229.45 s

and this is root-cause of my question.  As you can see there is huge 3x-4x time 
discrepancy between "elapsed" and user+system which is strange at least for me 
as there should be no waiting (autovacuum_vacuum_cost_delay=0). According to 
various tools it is true: Time was wasted somewhere else, but not in the 
PostgreSQL analyze. The ps(1) and pidstat(1) also report the same for the 
worker:

06:56:12 AM   PID%usr %system  %guest%CPU   CPU  Command
06:56:13 AM1147748.00   10.000.00   18.0018  postgres
06:56:14 AM1147748.00   11.000.00   19.0015  postgres
06:56:15 AM1147745.00   13.000.00   18.0018  postgres

06:56:17 AM   PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
06:56:18 AM114774  63746.53  0.00  0.00  postgres
06:56:19 AM114774  62896.00  0.00  0.00  postgres
06:56:20 AM114774  62920.00  0.00  0.00  postgres

One could argue that such autoanalyze worker could perform 5x more of work 
(%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s 
reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although 
reporting high utilization I'm pretty sure it could push much more. There can 
be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for 
statistics collection alone).

According to various gdb backtraces, a lot of time is spent here:
#0  0x7f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
#1  0x00741a16 in pread (__offset=811253760, __nbytes=8192, 
__buf=0x7f9413ab7280, __fd=) at /usr/include/bits/unistd.h:84
#2  FileRead (file=, buffer=0x7f9413ab7280 "\037\005", 
amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
#3  0x00764b8f in mdread (reln=, forknum=, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
#4  0x0073d69b in ReadBuffer_common (smgr=, 
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902, 
mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
at bufmgr.c:897
#5  0x0073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, 
forkNum=MAIN_FORKNUM, blockNum=19890902, mode=, 
strategy=) at bufmgr.c:665
#6  0x004c2e2f in heapam_scan_analyze_next_block (scan=, 
blockno=, bstrategy=) at heapam_handler.c:998
#7  0x00597de1 in table_scan_analyze_next_block (bstrategy=, blockno=, scan=0x10c8098) at 
../../../src/include/access/tableam.h:1462
#8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, 
targrows=150, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at 
analyze.c:1048
#9  0x00596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, 
params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 , 
relpages=26763525, inh=false,
in_outer_xact=false, elevel=13) at analyze.c:502
[..]
#12 0x006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, 
tab=) at autovacuum.c:3118
[..]

The interesting thing that targrows=1.5mlns and that blocks are accessed (as 
expected) in sorted order:

Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890910, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890912, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890922, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890935, 
bstrategy=0x1102278) at heapam_handler.c:984
Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, blockno=19890996, 
bstrategy=0x1102278) at heapam_handler.c:984

And probably this explains the discrepancy, perf with CPU usage reporting shows 
a lot of frames waiting on I/O on readaheads done by ext4, trimmed for clarity:

# Children  Self   sys   usr  Command   Shared Object   Symbol
63.64% 0.00% 0.00% 0.00%  postgres  [kernel.kallsyms]   [k] 
entry_SYSCALL_64_after_hwframe
--

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-07 Thread Tomas Vondra
Hi,

On 2/10/21 11:10 PM, Stephen Frost wrote:
> Greetings,
> 
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
>> On 05/02/2021 23:22, Stephen Frost wrote:
>>> Unless there's anything else on this, I'll commit these sometime next
>>> week.
>>
>> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
>> should call get_tablespace_maintenance_io_concurrency().
> 
> Ah, yeah, of course.
> 
> Updated patch attached.
> 

A couple minor comments:

1) I think the patch should be split into two parts, one adding the
track_io_timing, one adding the prefetching.

2) I haven't tried but I'm pretty sure there'll be a compiler warning
about 'prefetch_maximum' being unused without USE_PREFETCH defined.

3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?

This makes the code rather hard to read, IMHO. It seems to me we can
move the code around a bit and merge some of the #ifdef blocks - see the
attached patch. Most of this is fairly trivial, with the exception of
moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
does not materially change the behavior, but perhaps I'm wrong.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..492ed78e1c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7457,9 +7457,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e6..ab40be0771 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -440,6 +440,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
@@ -454,6 +456,11 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	{
 		pg_rusage_init(&ru0);
 		starttime = GetCurrentTimestamp();
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
 	}
 
 	if (params->options & VACOPT_VERBOSE)
@@ -674,6 +681,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageDirty);
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
+			if (track_io_timing)
+			{
+appendStringInfoString(&buf, _("I/O Timings:"));
+if (pgStatBlockReadTime - startreadtime > 0)
+	appendStringInfo(&buf, _(" read=%.3f"),
+	 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+if (pgStatBlockWriteTime - startwritetime > 0)
+	appendStringInfo(&buf, _(" write=%.3f"),
+	 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+appendStringInfoChar(&buf, '\n');
+			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
 			 _("WAL usage: %ld records, %ld full page images, %llu bytes"),
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..fc32598639 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -63,6 +63,7 @@
 #include "utils/pg_rusage.h"
 #include "utils/sampling.h"
 #include "utils/sortsupport.h"
+#include "utils/spccache.h"
 #include "utils/syscache.h"
 #include "utils/timestamp.h"
 
@@ -312,6 +313,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	int64		AnalyzePageHit = VacuumPageHit;
+	int64		AnalyzePageMiss = VacuumPageMiss;
+	int64		AnalyzePageDirty = VacuumPageDirty;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	if (inh)
 		ereport(elevel,
@@ -346,8 +352,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
+
 		pg_rusage_init(&ru0);
-		if (params->log_min_duration > 0)
+		if (params->log_min_duration >= 0)
 			starttime = GetCurrentTimestamp();
 	}
 
@@ -682,15 +694,90 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* Log the action if appropriate */
 

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-08 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 2/10/21 11:10 PM, Stephen Frost wrote:
> > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> >> On 05/02/2021 23:22, Stephen Frost wrote:
> >>> Unless there's anything else on this, I'll commit these sometime next
> >>> week.
> >>
> >> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> >> should call get_tablespace_maintenance_io_concurrency().
> > 
> > Ah, yeah, of course.
> > 
> > Updated patch attached.
> 
> A couple minor comments:
> 
> 1) I think the patch should be split into two parts, one adding the
> track_io_timing, one adding the prefetching.

This was already done..

> 2) I haven't tried but I'm pretty sure there'll be a compiler warning
> about 'prefetch_maximum' being unused without USE_PREFETCH defined.

Ah, that part is likely true, moved down into the #ifdef block to
address that, which also is good since it should avoid mistakenly using
it outside of the #ifdef's later on by mistake too.

> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?

Perhaps..

> This makes the code rather hard to read, IMHO. It seems to me we can
> move the code around a bit and merge some of the #ifdef blocks - see the
> attached patch. Most of this is fairly trivial, with the exception of
> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
> does not materially change the behavior, but perhaps I'm wrong.

but I don't particularly like doing the prefetch right before we run
vacuum_delay_point() and potentially sleep.

Rebased and updated patch attached.

Thanks!

Stephen
From ee8e7d0bfa28b8b41b8d9995f64f881d9ce351bd Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 5 Feb 2021 15:59:02 -0500
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze

When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 ++-
 src/backend/access/heap/vacuumlazy.c |  18 +
 src/backend/commands/analyze.c   | 100 +--
 3 files changed, 116 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 967de73596..492ed78e1c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7457,9 +7457,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d8f847b0e6..ab40be0771 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -440,6 +440,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
@@ -454,6 +456,11 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	{
 		pg_rusage_init(&ru0);
 		starttime = GetCurrentTimestamp();
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
 	}
 
 	if (params->options & VACOPT_VERBOSE)
@@ -674,6 +681,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageDirty);
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
+			if (track_io_timing)
+			{
+appendStringInfoString(&buf, _("I/O Timings:"));
+if (pgStatBlockReadTime - startreadtime > 0)
+	appendStringInfo(&buf, _(" read=%.3f"),
+	 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+if (pgStatBlockWriteTime - startwritetime > 0)
+	appendStringInfo(&buf, _(" write=%.3f"),
+	 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+appendStringInfoChar(&buf, '\n');
+			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
 			 _("WAL usage: %ld records, %ld full page images, %llu bytes"),
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..0864

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-10 Thread Tomas Vondra
On 3/8/21 8:42 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> On 2/10/21 11:10 PM, Stephen Frost wrote:
>>> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
 On 05/02/2021 23:22, Stephen Frost wrote:
> Unless there's anything else on this, I'll commit these sometime next
> week.

 One more thing: Instead of using 'effective_io_concurrency' GUC directly,
 should call get_tablespace_maintenance_io_concurrency().
>>>
>>> Ah, yeah, of course.
>>>
>>> Updated patch attached.
>>
>> A couple minor comments:
>>
>> 1) I think the patch should be split into two parts, one adding the
>> track_io_timing, one adding the prefetching.
> 
> This was already done..
> 

Not sure what you mean by "done"? I see the patch still does both
changes related to track_io_timing and prefetching.

>> 2) I haven't tried but I'm pretty sure there'll be a compiler warning
>> about 'prefetch_maximum' being unused without USE_PREFETCH defined.
> 
> Ah, that part is likely true, moved down into the #ifdef block to
> address that, which also is good since it should avoid mistakenly using
> it outside of the #ifdef's later on by mistake too.
> 

OK

>> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
> 
> Perhaps..
> 
>> This makes the code rather hard to read, IMHO. It seems to me we can
>> move the code around a bit and merge some of the #ifdef blocks - see the
>> attached patch. Most of this is fairly trivial, with the exception of
>> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
>> does not materially change the behavior, but perhaps I'm wrong.
> 
> but I don't particularly like doing the prefetch right before we run
> vacuum_delay_point() and potentially sleep.
> 

Why? Is that just a matter of personal preference (fair enough) or is
there a reason why that would be wrong/harmful?

I think e.g. prefetch_targblock could be moved to the next #ifdef, which
will eliminate the one-line ifdef.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: automatic analyze: readahead - add "IO read time" log message

2021-03-11 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 3/8/21 8:42 PM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> >> On 2/10/21 11:10 PM, Stephen Frost wrote:
> >>> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
>  On 05/02/2021 23:22, Stephen Frost wrote:
> > Unless there's anything else on this, I'll commit these sometime next
> > week.
> 
>  One more thing: Instead of using 'effective_io_concurrency' GUC directly,
>  should call get_tablespace_maintenance_io_concurrency().
> >>>
> >>> Ah, yeah, of course.
> >>>
> >>> Updated patch attached.
> >>
> >> A couple minor comments:
> >>
> >> 1) I think the patch should be split into two parts, one adding the
> >> track_io_timing, one adding the prefetching.
> > 
> > This was already done..
> 
> Not sure what you mean by "done"? I see the patch still does both
> changes related to track_io_timing and prefetching.

They are two patches..

➜  ~ grep Subject analyze_prefetch_v8.patch 
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
Subject: [PATCH 2/2] Use pre-fetching for ANALYZE

The first doesn't have any prefetch-related things, the second doesn't
have any track_io_timing things..

> >> 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
> > 
> > Perhaps..
> > 
> >> This makes the code rather hard to read, IMHO. It seems to me we can
> >> move the code around a bit and merge some of the #ifdef blocks - see the
> >> attached patch. Most of this is fairly trivial, with the exception of
> >> moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
> >> does not materially change the behavior, but perhaps I'm wrong.
> > 
> > but I don't particularly like doing the prefetch right before we run
> > vacuum_delay_point() and potentially sleep.
> 
> Why? Is that just a matter of personal preference (fair enough) or is
> there a reason why that would be wrong/harmful?

Telling the kernel "hey, we're about to need this, please go get it" and
then immediately going to sleep just strikes me as a bit off.  We should
be trying to minimize the time between prefetch and actual request for
the page.  Of course, there's going to be some times that we will issue
a prefetch and then come around again and end up hitting the limit and
sleeping before we actually request the page and maybe it doesn't
ultimately matter much but it just seems better to sleep first before
issuing the prefetch to minimize the amount 'outstanding' when we do end
up sleeping.  One thing about prefetching is that the kernel is
certainly within its rights to decide to drop the page before we
actually go to read it, if it's under pressure and we're just sleeping.

> I think e.g. prefetch_targblock could be moved to the next #ifdef, which
> will eliminate the one-line ifdef.

Sure, done in the attached.

Thanks for the review!  Unless there's other comments, I'll plan to push
this over the weekend or early next week.

Stephen
From b5cd546d32f1f384755761dbb34a215318a5e251 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 5 Feb 2021 15:59:02 -0500
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze

When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas, Tomas Vondra
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 ++-
 src/backend/access/heap/vacuumlazy.c |  18 +
 src/backend/commands/analyze.c   | 100 +--
 3 files changed, 116 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..5e551b9f6d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7457,9 +7457,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a65dcbebfa..3d01172837 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -441,6 +441,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgSt

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-11 Thread Tomas Vondra



On 3/12/21 1:11 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>> On 3/8/21 8:42 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
 On 2/10/21 11:10 PM, Stephen Frost wrote:
> * Heikki Linnakangas (hlinn...@iki.fi) wrote:
>> On 05/02/2021 23:22, Stephen Frost wrote:
>>> Unless there's anything else on this, I'll commit these sometime next
>>> week.
>>
>> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
>> should call get_tablespace_maintenance_io_concurrency().
>
> Ah, yeah, of course.
>
> Updated patch attached.

 A couple minor comments:

 1) I think the patch should be split into two parts, one adding the
 track_io_timing, one adding the prefetching.
>>>
>>> This was already done..
>>
>> Not sure what you mean by "done"? I see the patch still does both
>> changes related to track_io_timing and prefetching.
> 
> They are two patches..
> 
> ➜  ~ grep Subject analyze_prefetch_v8.patch 
> Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
> Subject: [PATCH 2/2] Use pre-fetching for ANALYZE
> 
> The first doesn't have any prefetch-related things, the second doesn't
> have any track_io_timing things..
> 

Oh! I didn't realize a single file can contain multiple separate
patches, so I saw a single file and assumed it's one patch. How do you
produce a single file, and is that better than just generating multiple
files using git format-patch?

 3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
>>>
>>> Perhaps..
>>>
 This makes the code rather hard to read, IMHO. It seems to me we can
 move the code around a bit and merge some of the #ifdef blocks - see the
 attached patch. Most of this is fairly trivial, with the exception of
 moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
 does not materially change the behavior, but perhaps I'm wrong.
>>>
>>> but I don't particularly like doing the prefetch right before we run
>>> vacuum_delay_point() and potentially sleep.
>>
>> Why? Is that just a matter of personal preference (fair enough) or is
>> there a reason why that would be wrong/harmful?
> 
> Telling the kernel "hey, we're about to need this, please go get it" and
> then immediately going to sleep just strikes me as a bit off.  We should
> be trying to minimize the time between prefetch and actual request for
> the page.  Of course, there's going to be some times that we will issue
> a prefetch and then come around again and end up hitting the limit and
> sleeping before we actually request the page and maybe it doesn't
> ultimately matter much but it just seems better to sleep first before
> issuing the prefetch to minimize the amount 'outstanding' when we do end
> up sleeping.  One thing about prefetching is that the kernel is
> certainly within its rights to decide to drop the page before we
> actually go to read it, if it's under pressure and we're just sleeping.
> 

I don't know, but this argument seems rather hand-wavy to me, TBH.

Firstly, we're prefetching quite far ahead anyway, so there's almost
always going to be a vacuum_delay_point() call between you issue the
prefetch and actually using the page. Secondly, vacuum_delay_point won't
really sleep for most of the calls (assuming the cost limit is not
unreasonably small).

The way I see it (a) it's unlikely a particular vacuum_delay_point()
call will sleep, while at the same time (b) it's quite probable one of
the calls between prefetching and using the page will sleep.

So I think the exact ordering of those two calls does not really matter,
at least not for this particular reason.


That being said, I don't feel strongly about this - if you prefer it
like this, so be it. I can live with two ifdefs instead of one.

>> I think e.g. prefetch_targblock could be moved to the next #ifdef, which
>> will eliminate the one-line ifdef.
> 
> Sure, done in the attached.
> 
> Thanks for the review!  Unless there's other comments, I'll plan to push
> this over the weekend or early next week.
> 

+1


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: automatic analyze: readahead - add "IO read time" log message

2021-03-11 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> On 3/12/21 1:11 AM, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> >> On 3/8/21 8:42 PM, Stephen Frost wrote:
> >>> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
>  On 2/10/21 11:10 PM, Stephen Frost wrote:
> > * Heikki Linnakangas (hlinn...@iki.fi) wrote:
> >> On 05/02/2021 23:22, Stephen Frost wrote:
> >>> Unless there's anything else on this, I'll commit these sometime next
> >>> week.
> >>
> >> One more thing: Instead of using 'effective_io_concurrency' GUC 
> >> directly,
> >> should call get_tablespace_maintenance_io_concurrency().
> >
> > Ah, yeah, of course.
> >
> > Updated patch attached.
> 
>  A couple minor comments:
> 
>  1) I think the patch should be split into two parts, one adding the
>  track_io_timing, one adding the prefetching.
> >>>
> >>> This was already done..
> >>
> >> Not sure what you mean by "done"? I see the patch still does both
> >> changes related to track_io_timing and prefetching.
> > 
> > They are two patches..
> > 
> > ➜  ~ grep Subject analyze_prefetch_v8.patch 
> > Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze
> > Subject: [PATCH 2/2] Use pre-fetching for ANALYZE
> > 
> > The first doesn't have any prefetch-related things, the second doesn't
> > have any track_io_timing things..
> 
> Oh! I didn't realize a single file can contain multiple separate
> patches, so I saw a single file and assumed it's one patch. How do you
> produce a single file, and is that better than just generating multiple
> files using git format-patch?

I use my gfp alias, which sends to stdout:

gfp='git format-patch @{u} --stdout'

So I only have one file to move around and there's no question about the
ordering of applying them and such..  Not sure if everyone would agree
that it's better or worse but I do find it works a bit better for me
than dealing with multiple files.

>  3) Is there a way to reduce the amount of #ifdef in acquire_sample_rows?
> >>>
> >>> Perhaps..
> >>>
>  This makes the code rather hard to read, IMHO. It seems to me we can
>  move the code around a bit and merge some of the #ifdef blocks - see the
>  attached patch. Most of this is fairly trivial, with the exception of
>  moving PrefetchBuffer before table_scan_analyze_next_block - AFAIK this
>  does not materially change the behavior, but perhaps I'm wrong.
> >>>
> >>> but I don't particularly like doing the prefetch right before we run
> >>> vacuum_delay_point() and potentially sleep.
> >>
> >> Why? Is that just a matter of personal preference (fair enough) or is
> >> there a reason why that would be wrong/harmful?
> > 
> > Telling the kernel "hey, we're about to need this, please go get it" and
> > then immediately going to sleep just strikes me as a bit off.  We should
> > be trying to minimize the time between prefetch and actual request for
> > the page.  Of course, there's going to be some times that we will issue
> > a prefetch and then come around again and end up hitting the limit and
> > sleeping before we actually request the page and maybe it doesn't
> > ultimately matter much but it just seems better to sleep first before
> > issuing the prefetch to minimize the amount 'outstanding' when we do end
> > up sleeping.  One thing about prefetching is that the kernel is
> > certainly within its rights to decide to drop the page before we
> > actually go to read it, if it's under pressure and we're just sleeping.
> 
> I don't know, but this argument seems rather hand-wavy to me, TBH.

I agree it is, but then again, the only argument raised against doing it
this way is that it might reduce the number of #ifdef's and, at least
for my part, that doesn't seem like a terribly strong argument in this
case.

> Firstly, we're prefetching quite far ahead anyway, so there's almost
> always going to be a vacuum_delay_point() call between you issue the
> prefetch and actually using the page. Secondly, vacuum_delay_point won't
> really sleep for most of the calls (assuming the cost limit is not
> unreasonably small).
> 
> The way I see it (a) it's unlikely a particular vacuum_delay_point()
> call will sleep, while at the same time (b) it's quite probable one of
> the calls between prefetching and using the page will sleep.

Doing the prefetch before the sleep sets us up in a position where
there'll always be one more page that's been prefetched while we're
sleeping, no matter when that sleep actually happens, than if we do the
prefetch after the sleep.  Considering the default's only 10, that's 10%
of the pages prefetched.

> So I think the exact ordering of those two calls does not really matter,
> at least not for this particular reason.

I don't see it as a huge deal, but..

> That being said, I don't feel strongly about this - if you prefer it
> like this, so be it. I can live with two ifdefs instead

Re: automatic analyze: readahead - add "IO read time" log message

2021-03-16 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> > >> I think e.g. prefetch_targblock could be moved to the next #ifdef, which
> > >> will eliminate the one-line ifdef.
> > > 
> > > Sure, done in the attached.
> > > 
> > > Thanks for the review!  Unless there's other comments, I'll plan to push
> > > this over the weekend or early next week.
> > 
> > +1
> 
> Thanks again for the review!

and pushed.

Thanks everyone for the suggestions and reviews, and to Jakub in
particular for starting us down this path towards improving things in
ANALYZE.

Stephen


signature.asc
Description: PGP signature


Re: automatic analyze: readahead - add "IO read time" log message

2020-10-26 Thread Stephen Frost
Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> I have I hope interesting observation (and nano patch proposal) on system 
> where statistics freshness is a critical factor. Autovacuum/autogathering 
> statistics was tuned to be pretty very aggressive:
> autovacuum_vacuum_cost_delay=0 (makes autovacuum_vacuum_cost_limit irrelevant)
> autovacuum_naptime=1s
> autovacuum_max_workers=4
> 
> Some critical table partitions are configured with:  
> autovacuum_analyze_scale_factor=0.001, autovacuum_analyze_threshold=5 to 
> force auto-analyze jobs pretty often. The interesting logs are like this:
> automatic analyze of table "t1" system usage: CPU: user: 37.52 s, system: 
> 23.01 s, elapsed: 252.14 s
> automatic analyze of table "t2" system usage: CPU: user: 38.70 s, system: 
> 22.63 s, elapsed: 317.33 s
> automatic analyze of table "t2" system usage: CPU: user: 39.38 s, system: 
> 21.43 s, elapsed: 213.58 s
> automatic analyze of table "t1" system usage: CPU: user: 37.91 s, system: 
> 24.49 s, elapsed: 229.45 s

That's certainly pretty aggressive. :)

> and this is root-cause of my question.  As you can see there is huge 3x-4x 
> time discrepancy between "elapsed" and user+system which is strange at least 
> for me as there should be no waiting (autovacuum_vacuum_cost_delay=0). 
> According to various tools it is true: Time was wasted somewhere else, but 
> not in the PostgreSQL analyze. The ps(1) and pidstat(1) also report the same 
> for the worker:

The user/system time there is time-on-CPU (hence the 'CPU: ' prefix).

> 06:56:12 AM   PID%usr %system  %guest%CPU   CPU  Command
> 06:56:13 AM1147748.00   10.000.00   18.0018  postgres
> 06:56:14 AM1147748.00   11.000.00   19.0015  postgres
> 06:56:15 AM1147745.00   13.000.00   18.0018  postgres
> 
> 06:56:17 AM   PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
> 06:56:18 AM114774  63746.53  0.00  0.00  postgres
> 06:56:19 AM114774  62896.00  0.00  0.00  postgres
> 06:56:20 AM114774  62920.00  0.00  0.00  postgres
> 
> One could argue that such autoanalyze worker could perform 5x more of work 
> (%CPU -> 100%) here. The I/O system is not performing a lot (total = 242MB/s 
> reads@22k IOPS, 7MB/s writes@7k IOPS, with service time 0.04ms), although 
> reporting high utilization I'm pretty sure it could push much more. There can 
> be up to 3x-4x of such 70-80MB/s analyzes in parallel (let's say 300MB/s for 
> statistics collection alone).

The analyze is doing more-or-less random i/o since it's skipping through
the table picking out select blocks, not doing regular sequential i/o.

> According to various gdb backtraces, a lot of time is spent here:
> #0  0x7f98cdfc9f73 in __pread_nocancel () from /lib64/libpthread.so.0
> #1  0x00741a16 in pread (__offset=811253760, __nbytes=8192, 
> __buf=0x7f9413ab7280, __fd=) at /usr/include/bits/unistd.h:84
> #2  FileRead (file=, buffer=0x7f9413ab7280 "\037\005", 
> amount=8192, offset=811253760, wait_event_info=167772173) at fd.c:1883
> #3  0x00764b8f in mdread (reln=, forknum= out>, blocknum=19890902, buffer=0x7f9413ab7280 "\037\005") at md.c:596
> #4  0x0073d69b in ReadBuffer_common (smgr=, 
> relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=19890902, 
> mode=RBM_NORMAL, strategy=0x1102278, hit=0x7fffba7e2d4f)
> at bufmgr.c:897
> #5  0x0073e27e in ReadBufferExtended (reln=0x7f98c0c9ded0, 
> forkNum=MAIN_FORKNUM, blockNum=19890902, mode=, 
> strategy=) at bufmgr.c:665
> #6  0x004c2e2f in heapam_scan_analyze_next_block (scan= out>, blockno=, bstrategy=) at 
> heapam_handler.c:998
> #7  0x00597de1 in table_scan_analyze_next_block (bstrategy= out>, blockno=, scan=0x10c8098) at 
> ../../../src/include/access/tableam.h:1462
> #8  acquire_sample_rows (onerel=0x7f98c0c9ded0, elevel=13, rows=0x1342e08, 
> targrows=150, totalrows=0x7fffba7e3160, totaldeadrows=0x7fffba7e3158) at 
> analyze.c:1048
> #9  0x00596a50 in do_analyze_rel (onerel=0x7f98c0c9ded0, 
> params=0x10744e4, va_cols=0x0, acquirefunc=0x597ca0 , 
> relpages=26763525, inh=false,
> in_outer_xact=false, elevel=13) at analyze.c:502
> [..]
> #12 0x006e76b4 in autovacuum_do_vac_analyze (bstrategy=0x1102278, 
> tab=) at autovacuum.c:3118
> [..]

Sure, we're blocked on a read call trying to get the next block.

> The interesting thing that targrows=1.5mlns and that blocks are accessed (as 
> expected) in sorted order:
> 
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> blockno=19890935

Re: automatic analyze: readahead - add "IO read time" log message

2020-10-27 Thread Jakub Wartak
Hi Stephen, hackers,

> The analyze is doing more-or-less random i/o since it's skipping through
> the table picking out select blocks, not doing regular sequential i/o.
VS
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
>> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
>> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
>Not really sure what's interesting here, but it does look like we're
>skipping through the table as expected.

Yes, but not randomly in this case. I wanted to point out that this is 
incrementing block number, therefore I've included this debug output which 
might trigger readahead heuristics.
Perhaps this depends on how the table was built / vacuumed ? (in this case, 
pure-INSERT-only; I would expect the same in time series DBs and DWHs). 

> With all those 'readahead' calls it certainly makes one wonder if the
> Linux kernel is reading more than just the block we're looking for
> because it thinks we're doing a sequential read and will therefore want
> the next few blocks when, in reality, we're going to skip past them,
> meaning that any readahead the kernel is doing is likely just wasted
> I/O.

I've done some quick&dirty tests with blockdev --setra/setfra 0 after spending 
time looking at the smgr/md/fd API changes required to find shortcut, but I'm 
getting actually a little bit worse timings at least on "laptop DB tests". One 
thing that I've noticed is that needs to be only for automatic-analyze, but not 
for automatic-vacuum where apparently there is some boost due to readahead.

> That definitely seems like a useful thing to include and thanks for the
> patch!  Please be sure to register it in the commitfest app:
> https://commitfest.postgresql.org

Thank you! Thread is now registered.

> I would think that, ideally, we'd teach analyze.c to work in the same
> way that bitmap heap scans do- that is, use posix_fadvise to let the
> kernel know what pages we're going to want next instead of the kernel
> guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> lot of time poking, but it doesn't look like analyze.c tries to do any
> prefetching today.  In a similar vein, I wonder if VACUUM should be
> doing prefetching too today, at least when it's skipping through the
> heap based on the visibility map and jumping over all-frozen pages.

My only idea would be that a lot of those blocks could be read asynchronously 
in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so maybe 
the the optimization is possible, but not until we'll have AIO ;)

> Haven't looked too closely at this but in general +1 on the idea and
> this approach looks pretty reasonable to me.  Only thing I can think of
> off-hand is to check how it compares to other places where we report IO
> read time and make sure that it looks similar.

Ok, I've changed the output in 002 version to include "avg read rate" just like 
in the autovacuum case but still maintaining single line output, e.g:  
automatic analyze of table "test.public.t1_default" avg read rate: 96.917 MB/s 
(read time: 2.52 s), system usage: CPU: user: 0.28 s, system: 0.26 s, elapsed: 
2.94 s

-J.diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..38d69d887e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
+   PgStat_Counter startreadtime = 0;
+   longstartblksread = 0;
 
if (inh)
ereport(elevel,
@@ -347,6 +349,8 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
{
pg_rusage_init(&ru0);
+   startreadtime = pgStatBlockReadTime;
+   startblksread = pgBufferUsage.shared_blks_read;
if (params->log_min_duration > 0)
starttime = GetCurrentTimestamp();
}
@@ -685,12 +689,27 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
if (params->log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, 
GetCurrentTimestamp(),
   
params->log_min_duration))
-   ereport(LOG,
-   (errmsg("automatic analyze of table 
\"%s.%s.%s\" system usage: %s",
-   
get_database_name(MyDatabaseId),
-   
get_namespace_name(RelationGetNam

Re: automatic analyze: readahead - add "IO read time" log message

2020-10-27 Thread Stephen Frost
Greetings Jakub,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> > The analyze is doing more-or-less random i/o since it's skipping through
> > the table picking out select blocks, not doing regular sequential i/o.
> VS
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890910, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890912, bstrategy=0x1102278) at heapam_handler.c:984
> >> Breakpoint 1, heapam_scan_analyze_next_block (scan=0x10c8098, 
> >> blockno=19890922, bstrategy=0x1102278) at heapam_handler.c:984
> >Not really sure what's interesting here, but it does look like we're
> >skipping through the table as expected.
> 
> Yes, but not randomly in this case. I wanted to point out that this is 
> incrementing block number, therefore I've included this debug output which 
> might trigger readahead heuristics.

Sure, it's incrementing, but it's skipping- this is very similar to what
we do with Bitmap Heap Scans, and that makes it different from a typical
sequential scan.

> Perhaps this depends on how the table was built / vacuumed ? (in this case, 
> pure-INSERT-only; I would expect the same in time series DBs and DWHs). 

No, this is how ANALYZE behaves and hasn't got much to do with how the
table was built or vacuumed.

> > With all those 'readahead' calls it certainly makes one wonder if the
> > Linux kernel is reading more than just the block we're looking for
> > because it thinks we're doing a sequential read and will therefore want
> > the next few blocks when, in reality, we're going to skip past them,
> > meaning that any readahead the kernel is doing is likely just wasted
> > I/O.
> 
> I've done some quick&dirty tests with blockdev --setra/setfra 0 after 
> spending time looking at the smgr/md/fd API changes required to find 
> shortcut, but I'm getting actually a little bit worse timings at least on 
> "laptop DB tests". One thing that I've noticed is that needs to be only for 
> automatic-analyze, but not for automatic-vacuum where apparently there is 
> some boost due to readahead.

Interesting that you weren't seeing any benefit to disabling readahead.
Were you able to see where the time in the kernel was going when
readahead was turned off for the ANALYZE?

Note that you shouldn't need to make changes to smgr/md/fd to leverage
posix_fadvise- what you would do is use PrefetchBuffer(), see
BitmapPrefetch().

The VACUUM case is going to be complicated by what's in the visibility
map.  A VACUUM that isn't able to skip any pages yet is certainly going
to benefit from the kernel's readahead, but a VACUUM that's able to skip
over pages likely wouldn't benefit as much.

The way to test this would look something like:

- COPY a bunch of data into a table
- make sure to commit that and any other ongoing transactions
- VACUUM FREEZE the table
- check the visibility map to make sure most of the pages are marked as
  all-frozen after the VACUUM FREEZE
- randomly UPDATE the table, to really get the effect, maybe update 20%
  of the pages while leaving the rest alone (and therefore 80% of the
  table should still have the all-frozen bit set in the visibility map)
- *then* do a VACUUM on the table and see what happens with different
  amounts of read-ahead (or, ideally, with posix_fadvise() being used to
  let the kernel know what pages we're going to actually want).

> > That definitely seems like a useful thing to include and thanks for the
> > patch!  Please be sure to register it in the commitfest app:
> > https://commitfest.postgresql.org
> 
> Thank you! Thread is now registered.

Great!

> > I would think that, ideally, we'd teach analyze.c to work in the same
> > way that bitmap heap scans do- that is, use posix_fadvise to let the
> > kernel know what pages we're going to want next instead of the kernel
> > guessing (incorrectly) or not doing any pre-fetching.  I didn't spend a
> > lot of time poking, but it doesn't look like analyze.c tries to do any
> > prefetching today.  In a similar vein, I wonder if VACUUM should be
> > doing prefetching too today, at least when it's skipping through the
> > heap based on the visibility map and jumping over all-frozen pages.
> 
> My only idea would be that a lot of those blocks could be read asynchronously 
> in batches (AIO) with POSIX_FADV_RANDOM issued on block-range before, so 
> maybe the the optimization is possible, but not until we'll have AIO ;)

Well, the idea is that posix_fadvise() usage through PrefetchBuffer()
gets us some of that by letting the kernel know what we're going to ask
for next.  AIO is a whole other animal that's been discussed off and on
around here but it's a much larger and more invasive change than just
calling posix_fadvise().

> > Haven't looked too closely at this but in general +1 on the idea and
> > this approach looks pretty reasonable to me.  Only thing I can think of
> > off-hand is to check how it compares

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-03 Thread Jakub Wartak
Hi Stephen, hackers,

>> > With all those 'readahead' calls it certainly makes one wonder if the
>> > Linux kernel is reading more than just the block we're looking for
>> > because it thinks we're doing a sequential read and will therefore want
>> > the next few blocks when, in reality, we're going to skip past them,
>> > meaning that any readahead the kernel is doing is likely just wasted
>> > I/O.
>> I've done some quick&dirty tests with blockdev --setra/setfra 0 after 
>> spending time looking at the smgr/md/fd API changes required to find 
>> shortcut, but I'm getting actually a little bit worse timings at least on 
>> "laptop DB tests". One thing that I've noticed is that needs to be only for 
>> automatic-analyze, but not for automatic-vacuum where apparently there is 
>> some boost due to readahead.

>Interesting that you weren't seeing any benefit to disabling readahead.

I've got some free minutes and I have repeated the exercise in more realistic
and strict environment that previous one to conclude that the current situation 
is preferable:

Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
35GB, 65GB (as reported by \di+)
Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might is 
different layer and might have different storage settings), VG on top of dm-0, 
LV with stripe-size 8kB, ext4.
s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
output of PgSQL12: 
INFO:  "x": scanned 150 of 22395442 pages, containing 112410444 live rows 
and 0 dead rows; 150 rows in sample, 1678321053 estimated total rows

Hot VFS cache:
Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is spent 
heapam_scan_analyze_next_block() -> .. -> pread() which causes ~70..80MB/s as 
reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat with average 
request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, hot caches, 
total elapsed ~3m
Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s

No VFS cache:
Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
read, ondemand_readahead visible in perf, total elapsed 3m30s
Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m29s
Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total elapsed 
3m28s

No VFS cache, readahead off:
Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors (*512=5.5kB), 
total elapsed 5m59s
Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, double-confirmed no readaheads [ 
pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
elapsed 5m56s
Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 /dev/dm-0: 
as above, total elapsed 5m55s

One thing not clear here is maybe in future worth measuring how striped LVs are 
being 
affected by readaheads.

>Were you able to see where the time in the kernel was going when
>readahead was turned off for the ANALYZE?

Yes, my interpretation is that the time spent goes into directly block I/O 
layer waiting. 

54.67% 1.33%  postgres  postgres[.] FileRead
---FileRead
--53.33%--__pread_nocancel
   --50.67%--entry_SYSCALL_64_after_hwframe
 do_syscall_64
 sys_pread64
 |--49.33%--vfs_read
 |   --48.00%--__vfs_read
 | 
|--45.33%--generic_file_read_iter
 | |  
|--42.67%--ondemand_readahead
 | |  | 
 __do_page_cache_readahead
 | |  | 
 |--25.33%--ext4_mpage_readpages
 | |  | 
 |  |--10.67%--submit_bio
 | |  | 
 |  |  generic_make_request
 | |  | 
 |  |  |--8.00%--blk_mq_make_request
 | |  | 
 |  |  |  |--4.00%--blk_mq_get_request
 | |  | 
 |  |  |  |  |--1.33%--blk_mq_get_tag
 | |  | 
 |  |  |  | 

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-03 Thread Stephen Frost
Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
> 
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current 
> situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
> 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might 
> is different layer and might have different storage settings), VG on top of 
> dm-0, LV with stripe-size 8kB, ext4.
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 
> output of PgSQL12: 
> INFO:  "x": scanned 150 of 22395442 pages, containing 112410444 live rows 
> and 0 dead rows; 150 rows in sample, 1678321053 estimated total rows
> 
> Hot VFS cache:
> Run0: Defaults, default RA on dm-1=256 (*512=128kB), most of the time is 
> spent heapam_scan_analyze_next_block() -> .. -> pread() which causes 
> ~70..80MB/s as reported by pidstat, maximum 22-25% CPU, ~8k IOPS in iostat 
> with average request size per IO=25 sectors(*512/1024 = ~12kB), readahead on, 
> hot caches, total elapsed ~3m
> Run1: Defaults, similar as above (hot VFS cache), total elapsed 2m:50s
> Run2: Defaults, similar as above (hot VFS cache), total elapsed 2m:42s
> Run3: Defaults, miliaria as above (hot VFS cache), total elapsed 2m:40s
> 
> No VFS cache:
> Run4: echo 3 > drop_caches, still with s_b=128MB: maximum 18-23% CPU, ~70MB/s 
> read, ondemand_readahead visible in perf, total elapsed 3m30s
> Run5: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m29s
> Run6: echo 3 > drop_caches, still with s_b=128MB: same as above, total 
> elapsed 3m28s
> 
> No VFS cache, readahead off:
> Run7: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: reads at 33MB/s, ~13% CPU, 8.7k read IOPS @ avgrq-sz = 11 sectors 
> (*512=5.5kB), total elapsed 5m59s
> Run8: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, double-confirmed no readaheads [ 
> pread()->generic_file_read_iter()->ext4_mpage_readpages()-> bio.. ], total 
> elapsed 5m56s
> Run9: echo 3 > drop_caches, still with s_b=128MB, blockdev --setra 0 
> /dev/dm-0: as above, total elapsed 5m55s

[ ... ]

> >The VACUUM case is going to be complicated by what's in the visibility
> >map. (..) 
> 
> After observing the ANALYZE readahead behavior benefit I've abandoned
> the case of testing much more advanced VACUUM processing, clearly Linux 
> read-ahead is beneficial in even simple cases.

This seems to be indicating that while the Linux kernel may end up
reading pages we don't end up needing, it's much more often the case
that it's ending up reading *some* pages that we do need, and that's
happening often enough that it more than makes up for the extra reads
being done.

Instead of having these guessing games between the kernel and what PG's
doing, however, we could potentially do better using posix_fadvise() to
tell the kernel, up front, exactly what blocks we are going to ask for,
and perhaps that would end up improving things.

Attached is a very much rough-n-ready patch for doing that, using
effective_io_concurrency to control how many blocks to pre-fetch for
ANALYZE (0 meaning 'none').  If you've got a chance to test with
different settings for effective_io_concurrency with the patch applied
to see what impact posix_fadvise() has on these ANALYZE runs, that would
be very cool to see.

Going between effective_cache_size = 0 and effective_cache_size = 10
with this patch, in some very quick testing on a laptop NVMe, while
making sure to drop caches and restart PG in between to clear out
shared_buffers, definitely shows that prefetching done this way is an
improvement over letting the kernel's normal read ahead handle it.

> >> My only idea would be that a lot of those blocks could be read 
> >> asynchronously in batches (AIO) with POSIX_FADV_RANDOM issued on 
> >> block-range before, so maybe the the optimization is possible, but not 
> >> until we'll have AIO ;)
> >
> > (..)AIO is a whole other animal that's been discussed off and on
> >around here but it's a much larger and more invasive change than just
> >calling posix_fadvise().
> 
> Yes, I'm aware and I'm keeping my fingers crossed that maybe some day 

I don't think we should throw out the idea of using PrefetchBuffer()
here.  "Real" AIO would certainly be good to have one of these days, but
until then, posix_fadvise() could net us some of those gains in the
meantime.

> The ANALYZE just seem fit to be natural candidate to use it. The only easy 
> chance 
> of acceleration of stats gathering - at least to me and enduser point of view 
> - 
> is to have more parallel autoanalyze workers running to drive more I/O 
> concurrency 
> (by e.g. partitioning the data),  both in readahe

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Jakub Wartak

Greetings Stephen,

I saw up 410MB/s for a few seconds with this patch on NVMe, and that's 
huge ~5.2x improvement which is amazing for a such simple patch.

The system and data was identical like last time, so results are directly 
comparable 
to the previous post. The only change is that I've applied Yours patch on top 
of 
REL12_STABLE as this is cluster where I was having original data. I think it 
deserves
 it's own commitfest entry, right? Or do we combine those two?

Each test was conducted with 
1. alter system set effective_io_concurrency=$N; 
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 
restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.

no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s

readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s 
, baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU 
util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)

I think one needs to be sure to restart, so that autovacuum 
workers get the new effective_io_concurrency, maybe that's a 
documentation thing only.

I have just one question, with this patch the strace I/O of analyze 
look like below:

pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0

If you highlight "738754560" in the output it appears to duplicate the 
syscalls issued until it preads() - in case of "738754560" offset it was 
asked for 3 times. Also I wouldn't  imagine in wildest dreams that 
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

-J.



From: Stephen Frost
Sent: Tuesday, November 3, 2020 6:47 PM
To: Jakub Wartak
Cc: pgsql-hackers
Subject: Re: automatic analyze: readahead - add "IO read time" log message 

Greetings,

* Jakub Wartak (jakub.war...@tomtom.com) wrote:
> >Interesting that you weren't seeing any benefit to disabling readahead.
> 
> I've got some free minutes and I have repeated the exercise in more realistic
> and strict environment that previous one to conclude that the current 
> situation is preferable:

Thanks for spending time on this!

> Analyzed table was having 171GB (as reported by \dt+) + indexes: 35GB, 147GB, 
> 35GB, 65GB (as reported by \di+)
> Linux kernel 4.14.x, 2x NVME under dm-0 (it might matter as /dev/dm-0 might 
> is different layer and might have different storage settings), VG on top of 
> dm-0, LV with stripe-size 8kB, ext4.
> s_b=128MB, RAM=128GB (- ~30GB which were reserved for HugePages), typical 

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Tomas Vondra

On Wed, Nov 04, 2020 at 09:07:59AM +, Jakub Wartak wrote:


Greetings Stephen,

I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
huge ~5.2x improvement which is amazing for a such simple patch.

The system and data was identical like last time, so results are directly 
comparable
to the previous post. The only change is that I've applied Yours patch on top of
REL12_STABLE as this is cluster where I was having original data. I think it 
deserves
it's own commitfest entry, right? Or do we combine those two?

Each test was conducted with
1. alter system set effective_io_concurrency=$N;
2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 
restart;
3. analyze verbose t;
4. the measurements were observed by pidstat -d and iostat -dxm.

no readahead (blockdev --setra 0), cold cache:
effective_io_concurrency=64; => ~26MB/s
effective_io_concurrency=32; => ~29MB/s
effective_io_concurrency=8; => ~31MB/s
effective_io_concurrency=4; => ~32MB/s
effective_io_concurrency=1; => ~32MB/s
effective_io_concurrency=0; => ~32MB/s

readahead=256 (128kB, Linux defaults), cold cache:
effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 22s 
, baseline)
effective_io_concurrency=1; => ~62MB/s
effective_io_concurrency=4; => ~370MB/s
effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, CPU 
util 100%
effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time ~39s)



Thanks for the testing, those are indeed quite promising numbers for
such a simple patch. I wonder how would this behave on a different
storage - I'll see if I can do some testing, but I guess it should be
the same as for other places that do prefetching.


I think one needs to be sure to restart, so that autovacuum
workers get the new effective_io_concurrency, maybe that's a
documentation thing only.

I have just one question, with this patch the strace I/O of analyze
look like below:

pread(72, ..., 8192, 738181120) = 8192
fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, "..., 8192, 738500608) = 8192
fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, .., 8192, 738607104) = 8192
fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
pread(72, ..., 8192, 738754560) = 8192
fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0

If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't  imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.



IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.

BTW it seems your e-mail client does something funny, stripping the
"references" headers, which breaks threading and makes following the
discussion very hard (and will confuse the CF app too). If you look at
pgsql-hackers archives, each of your responses starts a new thread.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 





Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Wed, Nov 04, 2020 at 09:07:59AM +, Jakub Wartak wrote:
> >I saw up 410MB/s for a few seconds with this patch on NVMe, and that's
> >huge ~5.2x improvement which is amazing for a such simple patch.

Nice!

> >The system and data was identical like last time, so results are directly 
> >comparable
> >to the previous post. The only change is that I've applied Yours patch on 
> >top of
> >REL12_STABLE as this is cluster where I was having original data. I think it 
> >deserves
> >it's own commitfest entry, right? Or do we combine those two?
> >
> >Each test was conducted with
> >1. alter system set effective_io_concurrency=$N;
> >2. echo 3 > /proc/sys/vm/drop_caches; 3x sync; /etc/init.d/postgresql-12 
> >restart;
> >3. analyze verbose t;
> >4. the measurements were observed by pidstat -d and iostat -dxm.
> >
> >no readahead (blockdev --setra 0), cold cache:
> >effective_io_concurrency=64; => ~26MB/s
> >effective_io_concurrency=32; => ~29MB/s
> >effective_io_concurrency=8; => ~31MB/s
> >effective_io_concurrency=4; => ~32MB/s
> >effective_io_concurrency=1; => ~32MB/s
> >effective_io_concurrency=0; => ~32MB/s

I'm speculating here, but it seems likely that by disabling read-ahead,
the posix_fadvise() calls are basically getting ignored and that's why
there's not much difference here...?

> >readahead=256 (128kB, Linux defaults), cold cache:
> >effective_io_concurrency=0; => ~60..68MB/s (analyze total elapsed time 3min 
> >22s , baseline)
> >effective_io_concurrency=1; => ~62MB/s
> >effective_io_concurrency=4; => ~370MB/s
> >effective_io_concurrency=8; => ~380..410MB/s <-- wow! up to 52k read IOPS, 
> >CPU util 100%
> >effective_io_concurrency=16; => ~320..390MB/s (analyze total elapsed time 
> >~39s)

Very nice.

> Thanks for the testing, those are indeed quite promising numbers for
> such a simple patch. I wonder how would this behave on a different
> storage - I'll see if I can do some testing, but I guess it should be
> the same as for other places that do prefetching.

That would certainly be cool.

> >I think one needs to be sure to restart, so that autovacuum
> >workers get the new effective_io_concurrency, maybe that's a
> >documentation thing only.
> >
> >I have just one question, with this patch the strace I/O of analyze
> >look like below:
> >
> >pread(72, ..., 8192, 738181120) = 8192
> >fadvise64(72, 738500608, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, "..., 8192, 738500608) = 8192
> >fadvise64(72, 738607104, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, .., 8192, 738607104) = 8192
> >fadvise64(72, 738754560, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >pread(72, ..., 8192, 738754560) = 8192
> >fadvise64(72, 738762752, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739115008, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739155968, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739385344, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739696640, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 739934208, 8192, POSIX_FADV_WILLNEED) = 0
> >fadvise64(72, 740122624, 8192, POSIX_FADV_WILLNEED) = 0
> >
> >If you highlight "738754560" in the output it appears to duplicate the
> >syscalls issued until it preads() - in case of "738754560" offset it was
> >asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
> 
> IMHO that'a a bug in the patch, which always tries to prefetch all
> "future" blocks, including those that were already prefetched. It
> probably needs to do something like bitmap heap scan where we track
> what was already prefetched and only issue the new blocks.

Yeah, this was just something quickly thrown together to see if it'd
help.  I'll 

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >If you highlight "738754560" in the output it appears to duplicate the
> >syscalls issued until it preads() - in case of "738754560" offset it was
> >asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
> 
> IMHO that'a a bug in the patch, which always tries to prefetch all
> "future" blocks, including those that were already prefetched. It
> probably needs to do something like bitmap heap scan where we track
> what was already prefetched and only issue the new blocks.

Updated patch attached which:

- Starts out by pre-fetching the first effective_io_concurrency number
  of blocks we are going to want, hopefully making it so the kernel will
  trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
  regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
  hopefully avoiding having the prefetching get in the way of the
  regular i/o.
- Added some comments, ran pgindent, added a commit message.

I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.

I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.

If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use.  Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.

Thanks,

Stephen
From b154e180718b4b57433093719a43ee6310d1da6f Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 4 Nov 2020 10:46:23 -0500
Subject: [PATCH] Use pre-fetching for ANALYZE

When we have posix_fadvise() available, we can improve the performance
of an ANALYZE by quite a bit by using it to inform the kernel of the
blocks that we're going to be asking for.

Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 src/backend/commands/analyze.c | 60 +-
 1 file changed, 59 insertions(+), 1 deletion(-)

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8af12b5c6b..bee3453fe4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1031,6 +1031,7 @@ acquire_sample_rows(Relation onerel, int elevel,
 	double		liverows = 0;	/* # live rows seen */
 	double		deadrows = 0;	/* # dead rows seen */
 	double		rowstoskip = -1;	/* -1 means not set yet */
+	long		randseed;		/* Seed for block sampler(s) */
 	BlockNumber totalblocks;
 	TransactionId OldestXmin;
 	BlockSamplerData bs;
@@ -1039,6 +1040,9 @@ acquire_sample_rows(Relation onerel, int elevel,
 	TableScanDesc scan;
 	BlockNumber nblocks;
 	BlockNumber blksdone = 0;
+#ifdef USE_PREFETCH
+	BlockSamplerData prefetch_bs;
+#endif
 
 	Assert(targrows > 0);
 
@@ -1048,7 +1052,13 @@ acquire_sample_rows(Relation onerel, int elevel,
 	OldestXmin = GetOldestNonRemovableTransactionId(onerel);
 
 	/* Prepare for sampling block numbers */
-	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, random());
+	randseed = random();
+	nblocks = BlockSampler_Init(&bs, totalblocks, targrows, randseed);
+
+#ifdef USE_PREFETCH
+	/* Create another BlockSampler, using the same seed, for prefetching */
+	(void) BlockSampler_Init(&prefetch_bs, totalblocks, targrows, randseed);
+#endif
 
 	/* Report sampling block numbers */
 	pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_TOTAL,
@@ -1060,16 +1070,64 @@ acquire_sample_rows(Relation onerel, int elevel,
 	scan = table_beginscan_analyze(onerel);
 	slot = table_slot_create(onerel, NULL);
 
+#ifdef USE_PREFETCH
+
+	/*
+	 * If we are doing prefetching, then go ahead and tell the kernel about
+	 * the first set of pages we are going to want.  This also moves our
+	 * iterator out ahead of the main one being used, where we will keep it so
+	 * that we're always pre-fetching out effective_io_concurrency number of
+	 * blocks ahead.
+	 */
+	if (effective_io_concurrency)
+	{
+		for (int i = 0; i < effective_io_concurrency; i++)
+		{
+			BlockNumber prefetch_block;
+
+			if (!BlockSampler_HasMore(&prefetch_bs))
+break;
+
+			prefetch_block = BlockSampler_Next(&prefetch_bs);
+			PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, prefetch_block);
+		}
+	}
+#endif
+
 	/* Outer loop over blocks to sample */
 	while (BlockSampler_HasMore(&bs))
 	{
+#ifdef USE_PREFETCH
+		BlockNumber prefetch_targblock = InvalidBlockNumber;
+#endif
 		BlockNumber targblock = BlockSampler_Next(&bs);
 
+#ifdef USE_PREFETCH
+
+		/*
+		 * Make sure that every time the main BlockSampler is moved forward
+		 * that our prefetch BlockSampler also gets moved forward, so that we
+		 * always stay out ahead.
+		 */
+		if (BlockSampler_HasMore(&prefetch_bs))
+			prefetch_targblock = BlockSampler_Next(&pre

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-04 Thread Tomas Vondra

Hi,

On 11/4/20 5:02 PM, Stephen Frost wrote:

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:

If you highlight "738754560" in the output it appears to duplicate the
syscalls issued until it preads() - in case of "738754560" offset it was
asked for 3 times. Also I wouldn't  imagine in wildest dreams that
posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.


IMHO that'a a bug in the patch, which always tries to prefetch all
"future" blocks, including those that were already prefetched. It
probably needs to do something like bitmap heap scan where we track
what was already prefetched and only issue the new blocks.


Updated patch attached which:

- Starts out by pre-fetching the first effective_io_concurrency number
   of blocks we are going to want, hopefully making it so the kernel will
   trust our fadvise's over its own read-ahead, right from the start.
- Makes sure the prefetch iterator is pushed forward whenever the
   regular interator is moved forward.
- After each page read, issues a prefetch, similar to BitmapHeapScan, to
   hopefully avoiding having the prefetching get in the way of the
   regular i/o.
- Added some comments, ran pgindent, added a commit message.



Nice, that was quick ;-)


I do think we should also include patch that Jakub wrote previously
which adds information about the read rate of ANALYZE.



+1


I'll look at integrating that into this patch and then look at a new
patch to do something similar for VACUUM in a bit.



+1


If you're doing further benchmarking of ANALYZE though, this would
probably be the better patch to use.  Certainly improved performance
here quite a bit with effective_io_concurrency set to 16.



Yeah. I'd expect this to be heavily dependent on hardware.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: automatic analyze: readahead - add "IO read time" log message

2020-11-09 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 11/4/20 5:02 PM, Stephen Frost wrote:
> >* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> >>>If you highlight "738754560" in the output it appears to duplicate the
> >>>syscalls issued until it preads() - in case of "738754560" offset it was
> >>>asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> >>>posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.
> >>
> >>IMHO that'a a bug in the patch, which always tries to prefetch all
> >>"future" blocks, including those that were already prefetched. It
> >>probably needs to do something like bitmap heap scan where we track
> >>what was already prefetched and only issue the new blocks.
> >
> >Updated patch attached which:
> >
> >- Starts out by pre-fetching the first effective_io_concurrency number
> >   of blocks we are going to want, hopefully making it so the kernel will
> >   trust our fadvise's over its own read-ahead, right from the start.
> >- Makes sure the prefetch iterator is pushed forward whenever the
> >   regular interator is moved forward.
> >- After each page read, issues a prefetch, similar to BitmapHeapScan, to
> >   hopefully avoiding having the prefetching get in the way of the
> >   regular i/o.
> >- Added some comments, ran pgindent, added a commit message.
> 
> Nice, that was quick ;-)

:)

> >I do think we should also include patch that Jakub wrote previously
> >which adds information about the read rate of ANALYZE.
> 
> +1

Attached is an updated patch which updates the documentation and
integrates Jakub's initial work on improving the logging around
auto-analyze (and I made the logging in auto-vacuum more-or-less match
it).

> >I'll look at integrating that into this patch and then look at a new
> >patch to do something similar for VACUUM in a bit.
> 
> +1

I spent some time looking into this but it's a bit complicated..  For
some sound reasons, VACUUM will avoid skipping through a table when
there's only a few pages that it could skip (not skipping allows us to
move forward the relfrozenxid).  That said, perhaps we could start doing
prefetching once we've decided that we're skipping.  We'd need to think
about if we have to worry about the VM changing between the pre-fetching
and the time when we're actually going to ask for the page..  I don't
*think* that's an issue because only VACUUM would be changing the pages
to be all-frozen or all-visible, and so if we see a page that isn't one
of those then we're going to want to visit that page and that's not
going to change, and we probably don't need to care about a page that
used to be all-frozen and now isn't during this run- but if the prefetch
went ahead and got page 10, and now page 8 is not-all-frozen and the
actual scan is at page 5, then maybe it wants page 8 next and that isn't
what we pre-fetched...

Anyhow, all-in-all, definitely more complicated and probably best
considered and discussed independently.

> >If you're doing further benchmarking of ANALYZE though, this would
> >probably be the better patch to use.  Certainly improved performance
> >here quite a bit with effective_io_concurrency set to 16.
> 
> Yeah. I'd expect this to be heavily dependent on hardware.

Sure, I agree with that too.

Thanks,

Stephen
From 59ace10ce767b08d3a175322bcabed2fc9010054 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 4 Nov 2020 10:46:23 -0500
Subject: [PATCH] Use pre-fetching for ANALYZE and improve AV logging

When we have posix_fadvise() available, we can improve the performance
of an ANALYZE by quite a bit by using it to inform the kernel of the
blocks that we're going to be asking for.  Similar to bitmap index
scans, the number of buffer pre-fetched is based off of the GUC
effective_io_concurrency.

Also, arrange to provide similar details for auto-analyze that we have
for auto-vaccum, about the read rate, dirty rate, and add I/O timing
(when track_io_timing is enabled) to both.

Stephen Frost and Jakub Wartak

Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 +-
 src/backend/access/heap/vacuumlazy.c |  18 +++
 src/backend/commands/analyze.c   | 157 +--
 3 files changed, 172 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..345a0305c6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7358,9 +7358,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change 

Re: automatic analyze: readahead - add "IO read time" log message

2020-11-09 Thread Tomas Vondra
On 11/9/20 7:06 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
>> On 11/4/20 5:02 PM, Stephen Frost wrote:
>>> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> If you highlight "738754560" in the output it appears to duplicate the
> syscalls issued until it preads() - in case of "738754560" offset it was
> asked for 3 times. Also I wouldn't  imagine in wildest dreams that
> posix_fadvise(POSIX_FADV_WILLNEED) is such a cheap syscall.

 IMHO that'a a bug in the patch, which always tries to prefetch all
 "future" blocks, including those that were already prefetched. It
 probably needs to do something like bitmap heap scan where we track
 what was already prefetched and only issue the new blocks.
>>>
>>> Updated patch attached which:
>>>
>>> - Starts out by pre-fetching the first effective_io_concurrency number
>>>   of blocks we are going to want, hopefully making it so the kernel will
>>>   trust our fadvise's over its own read-ahead, right from the start.
>>> - Makes sure the prefetch iterator is pushed forward whenever the
>>>   regular interator is moved forward.
>>> - After each page read, issues a prefetch, similar to BitmapHeapScan, to
>>>   hopefully avoiding having the prefetching get in the way of the
>>>   regular i/o.
>>> - Added some comments, ran pgindent, added a commit message.
>>
>> Nice, that was quick ;-)
> 
> :)
> 
>>> I do think we should also include patch that Jakub wrote previously
>>> which adds information about the read rate of ANALYZE.
>>
>> +1
> 
> Attached is an updated patch which updates the documentation and
> integrates Jakub's initial work on improving the logging around
> auto-analyze (and I made the logging in auto-vacuum more-or-less match
> it).
> 

Thanks. I'll do some testing/benchmarking once my machines are free, in
a couple days perhaps. But as I said before, I don't expect this to
behave very differently from other places that already do prefetching.

>>> I'll look at integrating that into this patch and then look at a new
>>> patch to do something similar for VACUUM in a bit.
>>
>> +1
> 
> I spent some time looking into this but it's a bit complicated..  For
> some sound reasons, VACUUM will avoid skipping through a table when
> there's only a few pages that it could skip (not skipping allows us to
> move forward the relfrozenxid).  That said, perhaps we could start doing
> prefetching once we've decided that we're skipping.  We'd need to think
> about if we have to worry about the VM changing between the pre-fetching
> and the time when we're actually going to ask for the page..  I don't
> *think* that's an issue because only VACUUM would be changing the pages
> to be all-frozen or all-visible, and so if we see a page that isn't one
> of those then we're going to want to visit that page and that's not
> going to change, and we probably don't need to care about a page that
> used to be all-frozen and now isn't during this run- but if the prefetch
> went ahead and got page 10, and now page 8 is not-all-frozen and the
> actual scan is at page 5, then maybe it wants page 8 next and that isn't
> what we pre-fetched...
> 
> Anyhow, all-in-all, definitely more complicated and probably best
> considered and discussed independently>

+1

FWIW I wonder if this should be tracked separately in the CF app, as
it's very different from the original "add some logging" patch, which
makes the CF entry rather misleading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: automatic analyze: readahead - add "IO read time" log message

2021-01-13 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> > Thanks. I'll do some testing/benchmarking once my machines are free, in
> > a couple days perhaps. But as I said before, I don't expect this to
> > behave very differently from other places that already do prefetching.
> 
> Agreed, but would still be nice to see test results beyond just what
> I've done.

Any chance you were able to run those tests..?

> > FWIW I wonder if this should be tracked separately in the CF app, as
> > it's very different from the original "add some logging" patch, which
> > makes the CF entry rather misleading.
> 
> I've gone ahead and updated the CF entry for this to hopefully make it
> clearer for those interested in looking at it.  I'll try to come back to
> this in the next CF, ideally we'd at least get someone else to take a
> look at the code beyond me. :)  (Obviously, you looked at it some, but
> wasn't really clear if you were alright with it or if you felt it needed
> more review.)

I've updated the patch to leverage Tom's introduction of
TimestatmpDifferenceMilliseconds, which simplifies things a bit (and I
don't think we need to worry about an analyze taking over 25 days...)
and generally rebased this up to current HEAD.

Would be great to get a review / comments from others as to if there's
any concerns.  I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)

Thanks,

Stephen
From 8ed687ee875943a0bc41daaf23f8d3c95f70c9b5 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 4 Nov 2020 10:46:23 -0500
Subject: [PATCH] Use pre-fetching for ANALYZE and improve AV logging

When we have posix_fadvise() available, we can improve the performance
of an ANALYZE by quite a bit by using it to inform the kernel of the
blocks that we're going to be asking for.  Similar to bitmap index
scans, the number of buffers pre-fetched is based off of the GUC
effective_io_concurrency.

Also, arrange to provide similar details for auto-analyze that we have
for auto-vaccum, about the read rate, dirty rate, and add I/O timing
(when track_io_timing is enabled) to both.

Stephen Frost and Jakub Wartak

Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 +-
 src/backend/access/heap/vacuumlazy.c |  18 +++
 src/backend/commands/analyze.c   | 160 +--
 3 files changed, 175 insertions(+), 11 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 82864bbb24..8b069bc176 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7409,9 +7409,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..3a5e5a1ac2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -440,6 +440,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
@@ -454,6 +456,11 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	{
 		pg_rusage_init(&ru0);
 		starttime = GetCurrentTimestamp();
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
 	}
 
 	if (params->options & VACOPT_VERBOSE)
@@ -674,6 +681,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageDirty);
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
+			if (track_io_timing)
+			{
+appendStringInfoString(&buf, _("I/O Timings:"));
+if (pgStatBlockReadTime - startreadtime > 0)
+	appendStringInfo(&buf, _(" read=%.3f"),
+	 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+if (pgStatBlockWriteTime - startwritetime > 0)
+	appendStringInfo(&buf, _(" write=%.3f"),
+	 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+appendStringInfoChar(&buf, '\n');
+			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
 			 _("WAL usage: %ld records, %ld full page images, %llu byte

Re: automatic analyze: readahead - add "IO read time" log message

2021-01-26 Thread Heikki Linnakangas

On 13/01/2021 23:17, Stephen Frost wrote:

Would be great to get a review / comments from others as to if there's
any concerns.  I'll admit that it seems reasonably straight-forward to
me, but hey, I wrote most of it, so that's not really a fair
assessment... ;)


Look good overall. A few minor comments:

The patch consists of two part: add stats to the log for auto-analyze, 
and implement prefetching. They seem like independent features, consider 
splitting into two patches.


It's a bit weird that you get more stats in the log for 
autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not 
really this patch's fault though.


This conflicts with the patch at 
https://commitfest.postgresql.org/31/2907/, to refactor the table AM 
analyze API. That's OK, it's straightforward to resolve regardless of 
which patch is committed first.



/* Outer loop over blocks to sample */
while (BlockSampler_HasMore(&bs))
{
#ifdef USE_PREFETCH
BlockNumber prefetch_targblock = InvalidBlockNumber;
#endif
BlockNumber targblock = BlockSampler_Next(&bs);

#ifdef USE_PREFETCH

/*
 * Make sure that every time the main BlockSampler is moved 
forward
 * that our prefetch BlockSampler also gets moved forward, so 
that we
 * always stay out ahead.
 */
if (BlockSampler_HasMore(&prefetch_bs))
prefetch_targblock = BlockSampler_Next(&prefetch_bs);
#endif

vacuum_delay_point();

if (!table_scan_analyze_next_block(scan, targblock, 
vac_strategy))
continue;

#ifdef USE_PREFETCH

/*
 * When pre-fetching, after we get a block, tell the kernel 
about the
 * next one we will want, if there's any left.
 */
if (effective_io_concurrency && prefetch_targblock != 
InvalidBlockNumber)
PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
prefetch_targblock);
#endif

while (table_scan_analyze_next_tuple(scan, OldestXmin, &liverows, 
&deadrows, slot))
{
...
}

pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
 
++blksdone);
}


If effective_io_concurrency == 0, this calls 
BlockSampler_Next(&prefetch_bs) anyway, which is a waste of cycles.


If table_scan_analyze_next_block() returns false, we skip the 
PrefetchBuffer() call. That seem wrong.


Is there any potential harm from calling PrefetchBuffer() on a page that 
table_scan_analyze_next_block() later deems as unsuitable for smapling 
and returns false? That's theoretical at the moment, because 
heapam_scan_analyze_next_block() always returns true. (The tableam 
ANALYZE API refactor patch will make this moot, as it moves this logic 
into the tableam's implementation, so the implementation can do whatever 
make sense for the particular AM.)


- Heikki




Re: automatic analyze: readahead - add "IO read time" log message

2021-02-05 Thread Stephen Frost
Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 13/01/2021 23:17, Stephen Frost wrote:
> >Would be great to get a review / comments from others as to if there's
> >any concerns.  I'll admit that it seems reasonably straight-forward to
> >me, but hey, I wrote most of it, so that's not really a fair
> >assessment... ;)
> 
> Look good overall. A few minor comments:

Thanks a lot for the review!

> The patch consists of two part: add stats to the log for auto-analyze, and
> implement prefetching. They seem like independent features, consider
> splitting into two patches.

Yeah, that's a good point.  I had anticipated that there would be
overlap but in the end there really wasn't.  Done in the attached.

> It's a bit weird that you get more stats in the log for
> autovacuum/autoanalyze than you get with VACUUM/ANALYZE VERBOSE. Not really
> this patch's fault though.

Agreed.

> This conflicts with the patch at https://commitfest.postgresql.org/31/2907/,
> to refactor the table AM analyze API. That's OK, it's straightforward to
> resolve regardless of which patch is committed first.

Agreed.

> > /* Outer loop over blocks to sample */
> > while (BlockSampler_HasMore(&bs))
> > {
> >#ifdef USE_PREFETCH
> > BlockNumber prefetch_targblock = InvalidBlockNumber;
> >#endif
> > BlockNumber targblock = BlockSampler_Next(&bs);
> >
> >#ifdef USE_PREFETCH
> >
> > /*
> >  * Make sure that every time the main BlockSampler is moved 
> > forward
> >  * that our prefetch BlockSampler also gets moved forward, so 
> > that we
> >  * always stay out ahead.
> >  */
> > if (BlockSampler_HasMore(&prefetch_bs))
> > prefetch_targblock = BlockSampler_Next(&prefetch_bs);
> >#endif
> >
> > vacuum_delay_point();
> >
> > if (!table_scan_analyze_next_block(scan, targblock, 
> > vac_strategy))
> > continue;
> >
> >#ifdef USE_PREFETCH
> >
> > /*
> >  * When pre-fetching, after we get a block, tell the kernel 
> > about the
> >  * next one we will want, if there's any left.
> >  */
> > if (effective_io_concurrency && prefetch_targblock != 
> > InvalidBlockNumber)
> > PrefetchBuffer(scan->rs_rd, MAIN_FORKNUM, 
> > prefetch_targblock);
> >#endif
> >
> > while (table_scan_analyze_next_tuple(scan, OldestXmin, 
> > &liverows, &deadrows, slot))
> > {
> > ...
> > }
> >
> > pgstat_progress_update_param(PROGRESS_ANALYZE_BLOCKS_DONE,
> >  
> > ++blksdone);
> > }
> 
> If effective_io_concurrency == 0, this calls BlockSampler_Next(&prefetch_bs)
> anyway, which is a waste of cycles.

Good point, fixed.

> If table_scan_analyze_next_block() returns false, we skip the
> PrefetchBuffer() call. That seem wrong.

Agreed, fixed.

> Is there any potential harm from calling PrefetchBuffer() on a page that
> table_scan_analyze_next_block() later deems as unsuitable for smapling and
> returns false? That's theoretical at the moment, because
> heapam_scan_analyze_next_block() always returns true. (The tableam ANALYZE
> API refactor patch will make this moot, as it moves this logic into the
> tableam's implementation, so the implementation can do whatever make sense
> for the particular AM.)

I can't see any potential harm and it seems pretty likely that if an
heapam_scan_analyze_next_block()-equivilant were to decide that a block
isn't appropriate to analyze it'd have to do so after reading that block
anyway, making the prefetch still useful.

Perhaps there'll be a case in the future where a given AM would know
based on just the block number that it isn't useful to analyze, in which
case it'd make sense to adjust the code to skip that block for both
Prefetching and actually reading, but I don't think that would be too
hard to do.  Doesn't seem sensible to invent that in advance of actually
having that case though- it's certainly not the case for heap AM today,
at least, as you say.

Unless there's anything else on this, I'll commit these sometime next
week.

Thanks again for the review!

Stephen
From a65e4a9a7d666d7cfcbbc03c2cf50a2552888cab Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 5 Feb 2021 15:59:02 -0500
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze

When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml

Re: automatic analyze: readahead - add "IO read time" log message

2021-02-08 Thread Heikki Linnakangas

On 05/02/2021 23:22, Stephen Frost wrote:

Unless there's anything else on this, I'll commit these sometime next
week.


One more thing: Instead of using 'effective_io_concurrency' GUC 
directly, should call get_tablespace_maintenance_io_concurrency().


- Heikki




Re: automatic analyze: readahead - add "IO read time" log message

2021-02-10 Thread Stephen Frost
Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> On 05/02/2021 23:22, Stephen Frost wrote:
> >Unless there's anything else on this, I'll commit these sometime next
> >week.
> 
> One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> should call get_tablespace_maintenance_io_concurrency().

Ah, yeah, of course.

Updated patch attached.

Thanks!

Stephen
From 5aa55b44230474e6e61873260e9595a5495a1094 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Fri, 5 Feb 2021 15:59:02 -0500
Subject: [PATCH 1/2] Improve logging of auto-vacuum and auto-analyze

When logging auto-vacuum and auto-analyze activity, include the I/O
timing if track_io_timing is enabled.  Also, for auto-analyze, add the
read rate and the dirty rate, similar to how that information has
historically been logged for auto-vacuum.

Stephen Frost and Jakub Wartak

Reviewed-By: Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/VI1PR0701MB69603A433348EDCF783C6ECBF6EF0%40VI1PR0701MB6960.eurprd07.prod.outlook.com
---
 doc/src/sgml/config.sgml |   8 ++-
 src/backend/access/heap/vacuumlazy.c |  18 +
 src/backend/commands/analyze.c   | 100 +--
 3 files changed, 116 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5ef1c7ad3c..2da2f2e32a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7411,9 +7411,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
 I/O timing information is
 displayed in 
 pg_stat_database, in the output of
- when the BUFFERS option is
-used, and by .  Only superusers can
-change this setting.
+ when the BUFFERS option
+is used, by autovacuum for auto-vacuums and auto-analyzes, when
+ is set and by
+.  Only superusers can change this
+setting.

   
  
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f3d2265fad..3a5e5a1ac2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -440,6 +440,8 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	TransactionId new_frozen_xid;
 	MultiXactId new_min_multi;
 	ErrorContextCallback errcallback;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	Assert(params != NULL);
 	Assert(params->index_cleanup != VACOPT_TERNARY_DEFAULT);
@@ -454,6 +456,11 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 	{
 		pg_rusage_init(&ru0);
 		starttime = GetCurrentTimestamp();
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
 	}
 
 	if (params->options & VACOPT_VERBOSE)
@@ -674,6 +681,17 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params,
 			 (long long) VacuumPageDirty);
 			appendStringInfo(&buf, _("avg read rate: %.3f MB/s, avg write rate: %.3f MB/s\n"),
 			 read_rate, write_rate);
+			if (track_io_timing)
+			{
+appendStringInfoString(&buf, _("I/O Timings:"));
+if (pgStatBlockReadTime - startreadtime > 0)
+	appendStringInfo(&buf, _(" read=%.3f"),
+	 (double) (pgStatBlockReadTime - startreadtime) / 1000);
+if (pgStatBlockWriteTime - startwritetime > 0)
+	appendStringInfo(&buf, _(" write=%.3f"),
+	 (double) (pgStatBlockWriteTime - startwritetime) / 1000);
+appendStringInfoChar(&buf, '\n');
+			}
 			appendStringInfo(&buf, _("system usage: %s\n"), pg_rusage_show(&ru0));
 			appendStringInfo(&buf,
 			 _("WAL usage: %ld records, %ld full page images, %llu bytes"),
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 7295cf0215..0864cda80e 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -312,6 +312,11 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
+	int64		AnalyzePageHit = VacuumPageHit;
+	int64		AnalyzePageMiss = VacuumPageMiss;
+	int64		AnalyzePageDirty = VacuumPageDirty;
+	PgStat_Counter startreadtime = 0;
+	PgStat_Counter startwritetime = 0;
 
 	if (inh)
 		ereport(elevel,
@@ -346,8 +351,14 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* measure elapsed time iff autovacuum logging requires it */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
+		if (track_io_timing)
+		{
+			startreadtime = pgStatBlockReadTime;
+			startwritetime = pgStatBlockWriteTime;
+		}
+
 		pg_rusage_init(&ru0);
-		if (params->log_min_duration > 0)
+		if (params->log_min_duration >= 0)
 			starttime = GetCurrentTimestamp();
 	}
 
@@ -682,15 +693,90 @@ do_analyze_rel(Relation onerel, VacuumParams *params,
 	/* Log the action if appropriate */
 	if (IsAutoVacuumWorkerProcess() && params->log_min_duration >= 0)
 	{
+		TimestampTz endtime = GetCurrentTimestamp();
+
 		if (params->log_min_duration == 0 ||
-			TimestampDifferen

Re: automatic analyze: readahead - add "IO read time" log message

2021-05-25 Thread Egor Rogov

Hi,

On 11.02.2021 01:10, Stephen Frost wrote:


Greetings,

* Heikki Linnakangas (hlinn...@iki.fi) wrote:

On 05/02/2021 23:22, Stephen Frost wrote:

Unless there's anything else on this, I'll commit these sometime next
week.

One more thing: Instead of using 'effective_io_concurrency' GUC directly,
should call get_tablespace_maintenance_io_concurrency().

Ah, yeah, of course.

Updated patch attached.



I happened to notice that get_tablespace_io_concurrency() is called 
instead of get_tablespace_maintenance_io_concurrency(). It doesn't look 
right, no?



Regards,
Egor Rogov.





Re: automatic analyze: readahead - add "IO read time" log message

2021-05-25 Thread Stephen Frost
Greetings,

* Egor Rogov (e.ro...@postgrespro.ru) wrote:
> On 11.02.2021 01:10, Stephen Frost wrote:
> >* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> >>On 05/02/2021 23:22, Stephen Frost wrote:
> >>>Unless there's anything else on this, I'll commit these sometime next
> >>>week.
> >>One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> >>should call get_tablespace_maintenance_io_concurrency().
> >Ah, yeah, of course.
> >
> >Updated patch attached.
> 
> I happened to notice that get_tablespace_io_concurrency() is called instead
> of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?

Hah, indeed you're right.  Will fix.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: automatic analyze: readahead - add "IO read time" log message

2021-05-25 Thread Justin Pryzby
This patch adds hits/misses/dirtied, but explain says hit/read/dirtied/written.

Should it say "read" instead of "misses" ?

src/backend/access/heap/vacuumlazy.c:   
 _("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),

src/backend/commands/explain.c- 
appendStringInfo(es->str, " hit=%lld",
src/backend/commands/explain.c- 
 (long long) usage->shared_blks_hit);
src/backend/commands/explain.c- if 
(usage->shared_blks_read > 0)
src/backend/commands/explain.c- 
appendStringInfo(es->str, " read=%lld",
src/backend/commands/explain.c- 
 (long long) usage->shared_blks_read);
src/backend/commands/explain.c- if 
(usage->shared_blks_dirtied > 0)
src/backend/commands/explain.c: 
appendStringInfo(es->str, " dirtied=%lld",
src/backend/commands/explain.c- 
 (long long) usage->shared_blks_dirtied);
src/backend/commands/explain.c- if 
(usage->shared_blks_written > 0)
src/backend/commands/explain.c- 
appendStringInfo(es->str, " written=%lld",
src/backend/commands/explain.c- 
 (long long) usage->shared_blks_written);
src/backend/commands/explain.c- if (has_local || 
has_temp)




Re: automatic analyze: readahead - add "IO read time" log message

2021-05-25 Thread Stephen Frost
Greetings,

* Justin Pryzby (pry...@telsasoft.com) wrote:
> This patch adds hits/misses/dirtied, but explain says 
> hit/read/dirtied/written.
> 
> Should it say "read" instead of "misses" ?
> 
> src/backend/access/heap/vacuumlazy.c: 
>_("buffer usage: %lld hits, %lld misses, %lld dirtied\n"),

Errr, the analyze readahead patch set didn't add the above use of
'misses', that's been around for an age or more.  For giggles, I hunted
it back to be this 2011 commit: 9d3b50244357ef4c4e3b6e01f91de599077179c8

As for the "misses" usage which was actually added by the patch here:

src/backend/commands/analyze.c: appendStringInfo(&buf, _("buffer usage: 
%lld hits, %lld misses, %lld dirtied\n"),

That was specifically done to match the 'vacuumlazy.c' usage of 'misses'
above.

As for which to use, I don't have any particular dog in that fight, but
I do think we should be consistent between how VACUUM reports this
information and how ANALYZE does.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: automatic analyze: readahead - add "IO read time" log message

2020-12-05 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@enterprisedb.com) wrote:
> Thanks. I'll do some testing/benchmarking once my machines are free, in
> a couple days perhaps. But as I said before, I don't expect this to
> behave very differently from other places that already do prefetching.

Agreed, but would still be nice to see test results beyond just what
I've done.

> FWIW I wonder if this should be tracked separately in the CF app, as
> it's very different from the original "add some logging" patch, which
> makes the CF entry rather misleading.

I've gone ahead and updated the CF entry for this to hopefully make it
clearer for those interested in looking at it.  I'll try to come back to
this in the next CF, ideally we'd at least get someone else to take a
look at the code beyond me. :)  (Obviously, you looked at it some, but
wasn't really clear if you were alright with it or if you felt it needed
more review.)

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: automatic analyze: readahead - add "IO read time" log message

2021-08-27 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Egor Rogov (e.ro...@postgrespro.ru) wrote:
> > On 11.02.2021 01:10, Stephen Frost wrote:
> > >* Heikki Linnakangas (hlinn...@iki.fi) wrote:
> > >>On 05/02/2021 23:22, Stephen Frost wrote:
> > >>>Unless there's anything else on this, I'll commit these sometime next
> > >>>week.
> > >>One more thing: Instead of using 'effective_io_concurrency' GUC directly,
> > >>should call get_tablespace_maintenance_io_concurrency().
> > >Ah, yeah, of course.
> > >
> > >Updated patch attached.
> > 
> > I happened to notice that get_tablespace_io_concurrency() is called instead
> > of get_tablespace_maintenance_io_concurrency(). It doesn't look right, no?
> 
> Hah, indeed you're right.  Will fix.

Found this under a bit of a pile in my todo list. :)

Fix pushed.

Thanks!

Stephen


signature.asc
Description: PGP signature