RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-09-01 Thread Sait Talha Nisanci
Hi,

The WAL size for "SSD, full_page_writes=on" was 36GB. I currently don't have 
the exact size for the other rows because my test VMs got auto-deleted. I can 
possibly redo the benchmark to get pg_waldump stats for each row.

Best,
Talha.


-Original Message-
From: Stephen Frost  
Sent: Sunday, August 30, 2020 3:24 PM
To: Tomas Vondra 
Cc: Robert Haas ; Andres Freund ; 
Sait Talha Nisanci ; Thomas Munro 
; Dmitry Dolgov <9erthali...@gmail.com>; David Steele 
; Alvaro Herrera ; pgsql-hackers 

Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:
> >* Robert Haas (robertmh...@gmail.com) wrote:
> >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> >>> > Hm? At least earlier versions didn't do prefetching for records with an 
> >>> > fpw, and only for subsequent records affecting the same or if not in 
> >>> > s_b anymore.
> >>>
> >>> We don't actually read the page when we're replaying an FPW though..?
> >>> If we don't read it, and we entirely write the page from the FPW, 
> >>> how is pre-fetching helping..?
> >>
> >>Suppose there is a checkpoint. Then we replay a record with an FPW, 
> >>pre-fetching nothing. Then the buffer gets evicted from 
> >>shared_buffers, and maybe the OS cache too. Then, before the next 
> >>checkpoint, we again replay a record for the same page. At this 
> >>point, pre-fetching should be helpful.
> >
> >Sure- but if we're talking about 25GB of WAL, on a server that's got 
> >32GB, then why would those pages end up getting evicted from memory 
> >entirely?  Particularly, enough of them to end up with such a huge 
> >difference in replay time..
> >
> >I do agree that if we've got more outstanding WAL between checkpoints 
> >than the system's got memory then that certainly changes things, but 
> >that wasn't what I understood the case to be here.
> 
> I don't think it's very clear how much WAL there actually was in each 
> case - the message only said there was more than 25GB, but who knows 
> how many checkpoints that covers? In the cases with FPW=on this may 
> easily be much less than one checkpoint (because with scale 45GB an 
> update to every page will log 45GB of full-page images). It'd be 
> interesting to see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a checkpoint, 
as you suggest.  Also, given that the comment was 'run is cancelled when there 
is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* 
there.

Ultimately though, there just isn't enough information provided to really be 
able to understand what's going on.  I agree, pg_waldump stats would be useful.

> >>Admittedly, I don't quite understand whether that is what is 
> >>happening in this test case, or why SDD vs. HDD should make any 
> >>difference. But there doesn't seem to be any reason why it doesn't 
> >>make sense in theory.
> >
> >I agree that this could be a reason, but it doesn't seem to quite fit 
> >in this particular case given the amount of memory and WAL.  I'm 
> >suspecting that it's something else and I'd very much like to know if 
> >it's a general "this applies to all (most?  a lot of?) SSDs because 
> >the hardware has a larger than 8KB page size and therefore the kernel 
> >has to read it", or if it's something odd about this particular 
> >system and doesn't apply generally.
> 
> Not sure. I doubt it has anything to do with the hardware page size, 
> that's mostly transparent to the kernel anyway. But it might be that 
> the prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter either, but 
it's entirely possible that assumption is wrong and that it does matter for 
some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe 
there's something else entirely going on.  About all I feel like I can say at 
the moment is that I'm very interested in ways to make WAL replay go faster and 
it'd be great to get more information about what's going on here to see if 
there's something we can do to generally improve WAL replay.

Thanks,

Stephen




Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-30 Thread Stephen Frost
Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:
> >* Robert Haas (robertmh...@gmail.com) wrote:
> >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> >>> > Hm? At least earlier versions didn't do prefetching for records with an 
> >>> > fpw, and only for subsequent records affecting the same or if not in 
> >>> > s_b anymore.
> >>>
> >>> We don't actually read the page when we're replaying an FPW though..?
> >>> If we don't read it, and we entirely write the page from the FPW, how is
> >>> pre-fetching helping..?
> >>
> >>Suppose there is a checkpoint. Then we replay a record with an FPW,
> >>pre-fetching nothing. Then the buffer gets evicted from
> >>shared_buffers, and maybe the OS cache too. Then, before the next
> >>checkpoint, we again replay a record for the same page. At this point,
> >>pre-fetching should be helpful.
> >
> >Sure- but if we're talking about 25GB of WAL, on a server that's got
> >32GB, then why would those pages end up getting evicted from memory
> >entirely?  Particularly, enough of them to end up with such a huge
> >difference in replay time..
> >
> >I do agree that if we've got more outstanding WAL between checkpoints
> >than the system's got memory then that certainly changes things, but
> >that wasn't what I understood the case to be here.
> 
> I don't think it's very clear how much WAL there actually was in each
> case - the message only said there was more than 25GB, but who knows how
> many checkpoints that covers? In the cases with FPW=on this may easily
> be much less than one checkpoint (because with scale 45GB an update to
> every page will log 45GB of full-page images). It'd be interesting to
> see some stats from pg_waldump etc.

Also in the message was this:

--
In order to avoid checkpoints during benchmark, max_wal_size(200GB) and
checkpoint_timeout(200 mins) are set to a high value.
--

Which lead me to suspect, at least, that this was much less than a
checkpoint, as you suggest.  Also, given that the comment was 'run is
cancelled when there is a reasonable amount of WAL (>25GB), seems likely
that it's at least *around* there.

Ultimately though, there just isn't enough information provided to
really be able to understand what's going on.  I agree, pg_waldump stats
would be useful.

> >>Admittedly, I don't quite understand whether that is what is happening
> >>in this test case, or why SDD vs. HDD should make any difference. But
> >>there doesn't seem to be any reason why it doesn't make sense in
> >>theory.
> >
> >I agree that this could be a reason, but it doesn't seem to quite fit in
> >this particular case given the amount of memory and WAL.  I'm suspecting
> >that it's something else and I'd very much like to know if it's a
> >general "this applies to all (most?  a lot of?) SSDs because the
> >hardware has a larger than 8KB page size and therefore the kernel has to
> >read it", or if it's something odd about this particular system and
> >doesn't apply generally.
> 
> Not sure. I doubt it has anything to do with the hardware page size,
> that's mostly transparent to the kernel anyway. But it might be that the
> prefetching on a particular SSD has more overhead than what it saves.

Right- I wouldn't have thought the hardware page size would matter
either, but it's entirely possible that assumption is wrong and that it
does matter for some reason- perhaps with just some SSDs, or maybe with
a lot of them, or maybe there's something else entirely going on.  About
all I feel like I can say at the moment is that I'm very interested in
ways to make WAL replay go faster and it'd be great to get more
information about what's going on here to see if there's something we
can do to generally improve WAL replay.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-29 Thread Tomas Vondra

On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:

Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:

On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> > Hm? At least earlier versions didn't do prefetching for records with an 
fpw, and only for subsequent records affecting the same or if not in s_b anymore.
>
> We don't actually read the page when we're replaying an FPW though..?
> If we don't read it, and we entirely write the page from the FPW, how is
> pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.


Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely?  Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.



I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows how
many checkpoints that covers? In the cases with FPW=on this may easily
be much less than one checkpoint (because with scale 45GB an update to
every page will log 45GB of full-page images). It'd be interesting to
see some stats from pg_waldump etc.


Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.


I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL.  I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most?  a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.



Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that the
prefetching on a particular SSD has more overhead than what it saves.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Sait Talha Nisanci
Hi Stephen,

OS version is Ubuntu 18.04.5 LTS.
Filesystem is ext4 and block size is 4KB.

Talha.

-Original Message-
From: Stephen Frost  
Sent: Thursday, August 27, 2020 4:56 PM
To: Sait Talha Nisanci 
Cc: Thomas Munro ; Tomas Vondra 
; Dmitry Dolgov <9erthali...@gmail.com>; David 
Steele ; Andres Freund ; Alvaro 
Herrera ; pgsql-hackers 
Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

Greetings,

* Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote:
> I have run some benchmarks for this patch. Overall it seems that there is a 
> good improvement with the patch on recovery times:

Maybe I missed it somewhere, but what's the OS/filesystem being used here..?  
What's the filesystem block size..?

Thanks,

Stephen




Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> > > Hm? At least earlier versions didn't do prefetching for records with an 
> > > fpw, and only for subsequent records affecting the same or if not in s_b 
> > > anymore.
> >
> > We don't actually read the page when we're replaying an FPW though..?
> > If we don't read it, and we entirely write the page from the FPW, how is
> > pre-fetching helping..?
> 
> Suppose there is a checkpoint. Then we replay a record with an FPW,
> pre-fetching nothing. Then the buffer gets evicted from
> shared_buffers, and maybe the OS cache too. Then, before the next
> checkpoint, we again replay a record for the same page. At this point,
> pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely?  Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.

> Admittedly, I don't quite understand whether that is what is happening
> in this test case, or why SDD vs. HDD should make any difference. But
> there doesn't seem to be any reason why it doesn't make sense in
> theory.

I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL.  I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most?  a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Robert Haas
On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost  wrote:
> > Hm? At least earlier versions didn't do prefetching for records with an 
> > fpw, and only for subsequent records affecting the same or if not in s_b 
> > anymore.
>
> We don't actually read the page when we're replaying an FPW though..?
> If we don't read it, and we entirely write the page from the FPW, how is
> pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On August 27, 2020 11:26:42 AM PDT, Stephen Frost  wrote:
> >Is WAL FPW compression enabled..?  I'm trying to figure out how, given
> >what's been shared here, that replaying 25GB of WAL is being helped out
> >by 2.5x thanks to prefetch in the SSD case.  That prefetch is hurting
> >in
> >the HDD case entirely makes sense to me- we're spending time reading
> >pages from the HDD, which is entirely pointless work given that we're
> >just going to write over those pages entirely with FPWs.
> 
> Hm? At least earlier versions didn't do prefetching for records with an fpw, 
> and only for subsequent records affecting the same or if not in s_b anymore.

We don't actually read the page when we're replaying an FPW though..?
If we don't read it, and we entirely write the page from the FPW, how is
pre-fetching helping..?  I understood how it could be helpful for
filesystems which have a larger block size than ours (eg: zfs w/ 16kb
block sizes where the kernel needs to get the whole 16kb block when we
only write 8kb to it), but that's apparently not the case here.

So- what is it that pre-fetching is doing to result in such an
improvement?  Is there something lower level where the SSD physical
block size is coming into play, which is typically larger..?  I wouldn't
have thought so, but perhaps that's the case..

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Andres Freund
Hi, 

On August 27, 2020 11:26:42 AM PDT, Stephen Frost  wrote:
>Is WAL FPW compression enabled..?  I'm trying to figure out how, given
>what's been shared here, that replaying 25GB of WAL is being helped out
>by 2.5x thanks to prefetch in the SSD case.  That prefetch is hurting
>in
>the HDD case entirely makes sense to me- we're spending time reading
>pages from the HDD, which is entirely pointless work given that we're
>just going to write over those pages entirely with FPWs.

Hm? At least earlier versions didn't do prefetching for records with an fpw, 
and only for subsequent records affecting the same or if not in s_b anymore.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Stephen Frost
Greetings,

* Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote:
> OS version is Ubuntu 18.04.5 LTS.
> Filesystem is ext4 and block size is 4KB.

[...]

* Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote:
> I have run some benchmarks for this patch. Overall it seems that there is a 
> good improvement with the patch on recovery times:
> 
> The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 
> 3000(so it doesn’t fit to memory, ~45GB).
> 
> In order to avoid checkpoints during benchmark, max_wal_size(200GB) and 
> checkpoint_timeout(200 mins) are set to a high value. 
> 
> The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The 
> recovery times are measured from the REDO logs.
> 
> I have tried combination of SSD, HDD, full_page_writes = on/off and 
> max_io_concurrency = 10/50, the recovery times are as follows (in seconds):
> 
>  No prefetch  | Default prefetch 
> values  |  Default + max_io_concurrency = 50
> SSD, full_page_writes = on852 301 
> 197
> SSD, full_page_writes = off   16421359
> 1391
> HDD, full_page_writes = on60276345
> 6390
> HDD, full_page_writes = off   738 275 
> 192
> 
> Default prefetch values:
> - Max_recovery_prefetch_distance = 256KB
> - Max_io_concurrency = 10
> 
> It probably makes sense to compare each row separately as the size of WAL can 
> be different.

Is WAL FPW compression enabled..?  I'm trying to figure out how, given
what's been shared here, that replaying 25GB of WAL is being helped out
by 2.5x thanks to prefetch in the SSD case.  That prefetch is hurting in
the HDD case entirely makes sense to me- we're spending time reading
pages from the HDD, which is entirely pointless work given that we're
just going to write over those pages entirely with FPWs.

Further, if there's 32GB of RAM, and WAL compression isn't enabled and
the WAL is only 25GB, then it's very likely that every page touched by
the WAL ends up in memory (shared buffers or fs cache), and with FPWs we
shouldn't ever need to actually read from the storage to get those
pages, right?  So how is prefetch helping so much..?

I'm not sure that the 'full_page_writes = off' tests are very
interesting in this case, since you're going to get torn pages and
therefore corruption and hopefully no one is running with that
configuration with this OS/filesystem.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Stephen Frost
Greetings,

* Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote:
> I have run some benchmarks for this patch. Overall it seems that there is a 
> good improvement with the patch on recovery times:

Maybe I missed it somewhere, but what's the OS/filesystem being used
here..?  What's the filesystem block size..?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-27 Thread Robert Haas
On Wed, Aug 26, 2020 at 9:42 AM Sait Talha Nisanci
 wrote:
> I have tried combination of SSD, HDD, full_page_writes = on/off and 
> max_io_concurrency = 10/50, the recovery times are as follows (in seconds):
>
>No prefetch  | Default prefetch 
> values  |  Default + max_io_concurrency = 50
> SSD, full_page_writes = on  852 301   
>   197
> SSD, full_page_writes = off 16421359  
>   1391
> HDD, full_page_writes = on  60276345  
>   6390
> HDD, full_page_writes = off 738 275   
>   192

The regression on HDD with full_page_writes=on is interesting. I don't
know why that should happen, and I wonder if there is anything that
can be done to mitigate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

2020-08-26 Thread Sait Talha Nisanci
I have run some benchmarks for this patch. Overall it seems that there is a 
good improvement with the patch on recovery times:

The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 
3000(so it doesn’t fit to memory, ~45GB).

In order to avoid checkpoints during benchmark, max_wal_size(200GB) and 
checkpoint_timeout(200 mins) are set to a high value. 

The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The 
recovery times are measured from the REDO logs.

I have tried combination of SSD, HDD, full_page_writes = on/off and 
max_io_concurrency = 10/50, the recovery times are as follows (in seconds):

   No prefetch  | Default prefetch 
values  |  Default + max_io_concurrency = 50
SSD, full_page_writes = on  852 301 
197
SSD, full_page_writes = off 16421359
1391
HDD, full_page_writes = on  60276345
6390
HDD, full_page_writes = off 738 275 
192

Default prefetch values:
-   Max_recovery_prefetch_distance = 256KB
-   Max_io_concurrency = 10

It probably makes sense to compare each row separately as the size of WAL can 
be different.

Talha.

-Original Message-
From: Thomas Munro  
Sent: Thursday, August 13, 2020 9:57 AM
To: Tomas Vondra 
Cc: Stephen Frost ; Dmitry Dolgov <9erthali...@gmail.com>; 
David Steele ; Andres Freund ; Alvaro 
Herrera ; pgsql-hackers 
Subject: [EXTERNAL] Re: WIP: WAL prefetch (another approach)

On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra  
wrote:
> On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote:
> >On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra
> >> Any luck trying to reproduce thigs? Should I try again and collect 
> >> some additional debug info?
> >
> >No luck.  I'm working on it now, and also trying to reduce the 
> >overheads so that we're not doing extra work when it doesn't help.
>
> OK, I'll see if I can still reproduce it.

Since someone else ask me off-list, here's a rebase, with no functional 
changes.  Soon I'll post a new improved version, but this version just fixes 
the bitrot and hopefully turns cfbot green.