RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Hi, The WAL size for "SSD, full_page_writes=on" was 36GB. I currently don't have the exact size for the other rows because my test VMs got auto-deleted. I can possibly redo the benchmark to get pg_waldump stats for each row. Best, Talha. -Original Message- From: Stephen Frost Sent: Sunday, August 30, 2020 3:24 PM To: Tomas Vondra Cc: Robert Haas ; Andres Freund ; Sait Talha Nisanci ; Thomas Munro ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; Alvaro Herrera ; pgsql-hackers Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach) Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote: > >* Robert Haas (robertmh...@gmail.com) wrote: > >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > >>> > Hm? At least earlier versions didn't do prefetching for records with an > >>> > fpw, and only for subsequent records affecting the same or if not in > >>> > s_b anymore. > >>> > >>> We don't actually read the page when we're replaying an FPW though..? > >>> If we don't read it, and we entirely write the page from the FPW, > >>> how is pre-fetching helping..? > >> > >>Suppose there is a checkpoint. Then we replay a record with an FPW, > >>pre-fetching nothing. Then the buffer gets evicted from > >>shared_buffers, and maybe the OS cache too. Then, before the next > >>checkpoint, we again replay a record for the same page. At this > >>point, pre-fetching should be helpful. > > > >Sure- but if we're talking about 25GB of WAL, on a server that's got > >32GB, then why would those pages end up getting evicted from memory > >entirely? Particularly, enough of them to end up with such a huge > >difference in replay time.. > > > >I do agree that if we've got more outstanding WAL between checkpoints > >than the system's got memory then that certainly changes things, but > >that wasn't what I understood the case to be here. > > I don't think it's very clear how much WAL there actually was in each > case - the message only said there was more than 25GB, but who knows > how many checkpoints that covers? In the cases with FPW=on this may > easily be much less than one checkpoint (because with scale 45GB an > update to every page will log 45GB of full-page images). It'd be > interesting to see some stats from pg_waldump etc. Also in the message was this: -- In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value. -- Which lead me to suspect, at least, that this was much less than a checkpoint, as you suggest. Also, given that the comment was 'run is cancelled when there is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* there. Ultimately though, there just isn't enough information provided to really be able to understand what's going on. I agree, pg_waldump stats would be useful. > >>Admittedly, I don't quite understand whether that is what is > >>happening in this test case, or why SDD vs. HDD should make any > >>difference. But there doesn't seem to be any reason why it doesn't > >>make sense in theory. > > > >I agree that this could be a reason, but it doesn't seem to quite fit > >in this particular case given the amount of memory and WAL. I'm > >suspecting that it's something else and I'd very much like to know if > >it's a general "this applies to all (most? a lot of?) SSDs because > >the hardware has a larger than 8KB page size and therefore the kernel > >has to read it", or if it's something odd about this particular > >system and doesn't apply generally. > > Not sure. I doubt it has anything to do with the hardware page size, > that's mostly transparent to the kernel anyway. But it might be that > the prefetching on a particular SSD has more overhead than what it saves. Right- I wouldn't have thought the hardware page size would matter either, but it's entirely possible that assumption is wrong and that it does matter for some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe there's something else entirely going on. About all I feel like I can say at the moment is that I'm very interested in ways to make WAL replay go faster and it'd be great to get more information about what's going on here to see if there's something we can do to generally improve WAL replay. Thanks, Stephen
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote: > On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote: > >* Robert Haas (robertmh...@gmail.com) wrote: > >>On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > >>> > Hm? At least earlier versions didn't do prefetching for records with an > >>> > fpw, and only for subsequent records affecting the same or if not in > >>> > s_b anymore. > >>> > >>> We don't actually read the page when we're replaying an FPW though..? > >>> If we don't read it, and we entirely write the page from the FPW, how is > >>> pre-fetching helping..? > >> > >>Suppose there is a checkpoint. Then we replay a record with an FPW, > >>pre-fetching nothing. Then the buffer gets evicted from > >>shared_buffers, and maybe the OS cache too. Then, before the next > >>checkpoint, we again replay a record for the same page. At this point, > >>pre-fetching should be helpful. > > > >Sure- but if we're talking about 25GB of WAL, on a server that's got > >32GB, then why would those pages end up getting evicted from memory > >entirely? Particularly, enough of them to end up with such a huge > >difference in replay time.. > > > >I do agree that if we've got more outstanding WAL between checkpoints > >than the system's got memory then that certainly changes things, but > >that wasn't what I understood the case to be here. > > I don't think it's very clear how much WAL there actually was in each > case - the message only said there was more than 25GB, but who knows how > many checkpoints that covers? In the cases with FPW=on this may easily > be much less than one checkpoint (because with scale 45GB an update to > every page will log 45GB of full-page images). It'd be interesting to > see some stats from pg_waldump etc. Also in the message was this: -- In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value. -- Which lead me to suspect, at least, that this was much less than a checkpoint, as you suggest. Also, given that the comment was 'run is cancelled when there is a reasonable amount of WAL (>25GB), seems likely that it's at least *around* there. Ultimately though, there just isn't enough information provided to really be able to understand what's going on. I agree, pg_waldump stats would be useful. > >>Admittedly, I don't quite understand whether that is what is happening > >>in this test case, or why SDD vs. HDD should make any difference. But > >>there doesn't seem to be any reason why it doesn't make sense in > >>theory. > > > >I agree that this could be a reason, but it doesn't seem to quite fit in > >this particular case given the amount of memory and WAL. I'm suspecting > >that it's something else and I'd very much like to know if it's a > >general "this applies to all (most? a lot of?) SSDs because the > >hardware has a larger than 8KB page size and therefore the kernel has to > >read it", or if it's something odd about this particular system and > >doesn't apply generally. > > Not sure. I doubt it has anything to do with the hardware page size, > that's mostly transparent to the kernel anyway. But it might be that the > prefetching on a particular SSD has more overhead than what it saves. Right- I wouldn't have thought the hardware page size would matter either, but it's entirely possible that assumption is wrong and that it does matter for some reason- perhaps with just some SSDs, or maybe with a lot of them, or maybe there's something else entirely going on. About all I feel like I can say at the moment is that I'm very interested in ways to make WAL replay go faster and it'd be great to get more information about what's going on here to see if there's something we can do to generally improve WAL replay. Thanks, Stephen signature.asc Description: PGP signature
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote: Greetings, * Robert Haas (robertmh...@gmail.com) wrote: On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > > Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore. > > We don't actually read the page when we're replaying an FPW though..? > If we don't read it, and we entirely write the page from the FPW, how is > pre-fetching helping..? Suppose there is a checkpoint. Then we replay a record with an FPW, pre-fetching nothing. Then the buffer gets evicted from shared_buffers, and maybe the OS cache too. Then, before the next checkpoint, we again replay a record for the same page. At this point, pre-fetching should be helpful. Sure- but if we're talking about 25GB of WAL, on a server that's got 32GB, then why would those pages end up getting evicted from memory entirely? Particularly, enough of them to end up with such a huge difference in replay time.. I do agree that if we've got more outstanding WAL between checkpoints than the system's got memory then that certainly changes things, but that wasn't what I understood the case to be here. I don't think it's very clear how much WAL there actually was in each case - the message only said there was more than 25GB, but who knows how many checkpoints that covers? In the cases with FPW=on this may easily be much less than one checkpoint (because with scale 45GB an update to every page will log 45GB of full-page images). It'd be interesting to see some stats from pg_waldump etc. Admittedly, I don't quite understand whether that is what is happening in this test case, or why SDD vs. HDD should make any difference. But there doesn't seem to be any reason why it doesn't make sense in theory. I agree that this could be a reason, but it doesn't seem to quite fit in this particular case given the amount of memory and WAL. I'm suspecting that it's something else and I'd very much like to know if it's a general "this applies to all (most? a lot of?) SSDs because the hardware has a larger than 8KB page size and therefore the kernel has to read it", or if it's something odd about this particular system and doesn't apply generally. Not sure. I doubt it has anything to do with the hardware page size, that's mostly transparent to the kernel anyway. But it might be that the prefetching on a particular SSD has more overhead than what it saves. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Hi Stephen, OS version is Ubuntu 18.04.5 LTS. Filesystem is ext4 and block size is 4KB. Talha. -Original Message- From: Stephen Frost Sent: Thursday, August 27, 2020 4:56 PM To: Sait Talha Nisanci Cc: Thomas Munro ; Tomas Vondra ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; Andres Freund ; Alvaro Herrera ; pgsql-hackers Subject: Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach) Greetings, * Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote: > I have run some benchmarks for this patch. Overall it seems that there is a > good improvement with the patch on recovery times: Maybe I missed it somewhere, but what's the OS/filesystem being used here..? What's the filesystem block size..? Thanks, Stephen
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Robert Haas (robertmh...@gmail.com) wrote: > On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > > > Hm? At least earlier versions didn't do prefetching for records with an > > > fpw, and only for subsequent records affecting the same or if not in s_b > > > anymore. > > > > We don't actually read the page when we're replaying an FPW though..? > > If we don't read it, and we entirely write the page from the FPW, how is > > pre-fetching helping..? > > Suppose there is a checkpoint. Then we replay a record with an FPW, > pre-fetching nothing. Then the buffer gets evicted from > shared_buffers, and maybe the OS cache too. Then, before the next > checkpoint, we again replay a record for the same page. At this point, > pre-fetching should be helpful. Sure- but if we're talking about 25GB of WAL, on a server that's got 32GB, then why would those pages end up getting evicted from memory entirely? Particularly, enough of them to end up with such a huge difference in replay time.. I do agree that if we've got more outstanding WAL between checkpoints than the system's got memory then that certainly changes things, but that wasn't what I understood the case to be here. > Admittedly, I don't quite understand whether that is what is happening > in this test case, or why SDD vs. HDD should make any difference. But > there doesn't seem to be any reason why it doesn't make sense in > theory. I agree that this could be a reason, but it doesn't seem to quite fit in this particular case given the amount of memory and WAL. I'm suspecting that it's something else and I'd very much like to know if it's a general "this applies to all (most? a lot of?) SSDs because the hardware has a larger than 8KB page size and therefore the kernel has to read it", or if it's something odd about this particular system and doesn't apply generally. Thanks, Stephen signature.asc Description: PGP signature
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost wrote: > > Hm? At least earlier versions didn't do prefetching for records with an > > fpw, and only for subsequent records affecting the same or if not in s_b > > anymore. > > We don't actually read the page when we're replaying an FPW though..? > If we don't read it, and we entirely write the page from the FPW, how is > pre-fetching helping..? Suppose there is a checkpoint. Then we replay a record with an FPW, pre-fetching nothing. Then the buffer gets evicted from shared_buffers, and maybe the OS cache too. Then, before the next checkpoint, we again replay a record for the same page. At this point, pre-fetching should be helpful. Admittedly, I don't quite understand whether that is what is happening in this test case, or why SDD vs. HDD should make any difference. But there doesn't seem to be any reason why it doesn't make sense in theory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Andres Freund (and...@anarazel.de) wrote: > On August 27, 2020 11:26:42 AM PDT, Stephen Frost wrote: > >Is WAL FPW compression enabled..? I'm trying to figure out how, given > >what's been shared here, that replaying 25GB of WAL is being helped out > >by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting > >in > >the HDD case entirely makes sense to me- we're spending time reading > >pages from the HDD, which is entirely pointless work given that we're > >just going to write over those pages entirely with FPWs. > > Hm? At least earlier versions didn't do prefetching for records with an fpw, > and only for subsequent records affecting the same or if not in s_b anymore. We don't actually read the page when we're replaying an FPW though..? If we don't read it, and we entirely write the page from the FPW, how is pre-fetching helping..? I understood how it could be helpful for filesystems which have a larger block size than ours (eg: zfs w/ 16kb block sizes where the kernel needs to get the whole 16kb block when we only write 8kb to it), but that's apparently not the case here. So- what is it that pre-fetching is doing to result in such an improvement? Is there something lower level where the SSD physical block size is coming into play, which is typically larger..? I wouldn't have thought so, but perhaps that's the case.. Thanks, Stephen signature.asc Description: PGP signature
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Hi, On August 27, 2020 11:26:42 AM PDT, Stephen Frost wrote: >Is WAL FPW compression enabled..? I'm trying to figure out how, given >what's been shared here, that replaying 25GB of WAL is being helped out >by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting >in >the HDD case entirely makes sense to me- we're spending time reading >pages from the HDD, which is entirely pointless work given that we're >just going to write over those pages entirely with FPWs. Hm? At least earlier versions didn't do prefetching for records with an fpw, and only for subsequent records affecting the same or if not in s_b anymore. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote: > OS version is Ubuntu 18.04.5 LTS. > Filesystem is ext4 and block size is 4KB. [...] * Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote: > I have run some benchmarks for this patch. Overall it seems that there is a > good improvement with the patch on recovery times: > > The VMs I used have 32GB RAM, pgbench is initialized with a scale factor > 3000(so it doesn’t fit to memory, ~45GB). > > In order to avoid checkpoints during benchmark, max_wal_size(200GB) and > checkpoint_timeout(200 mins) are set to a high value. > > The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The > recovery times are measured from the REDO logs. > > I have tried combination of SSD, HDD, full_page_writes = on/off and > max_io_concurrency = 10/50, the recovery times are as follows (in seconds): > > No prefetch | Default prefetch > values | Default + max_io_concurrency = 50 > SSD, full_page_writes = on852 301 > 197 > SSD, full_page_writes = off 16421359 > 1391 > HDD, full_page_writes = on60276345 > 6390 > HDD, full_page_writes = off 738 275 > 192 > > Default prefetch values: > - Max_recovery_prefetch_distance = 256KB > - Max_io_concurrency = 10 > > It probably makes sense to compare each row separately as the size of WAL can > be different. Is WAL FPW compression enabled..? I'm trying to figure out how, given what's been shared here, that replaying 25GB of WAL is being helped out by 2.5x thanks to prefetch in the SSD case. That prefetch is hurting in the HDD case entirely makes sense to me- we're spending time reading pages from the HDD, which is entirely pointless work given that we're just going to write over those pages entirely with FPWs. Further, if there's 32GB of RAM, and WAL compression isn't enabled and the WAL is only 25GB, then it's very likely that every page touched by the WAL ends up in memory (shared buffers or fs cache), and with FPWs we shouldn't ever need to actually read from the storage to get those pages, right? So how is prefetch helping so much..? I'm not sure that the 'full_page_writes = off' tests are very interesting in this case, since you're going to get torn pages and therefore corruption and hopefully no one is running with that configuration with this OS/filesystem. Thanks, Stephen signature.asc Description: PGP signature
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
Greetings, * Sait Talha Nisanci (sait.nisa...@microsoft.com) wrote: > I have run some benchmarks for this patch. Overall it seems that there is a > good improvement with the patch on recovery times: Maybe I missed it somewhere, but what's the OS/filesystem being used here..? What's the filesystem block size..? Thanks, Stephen signature.asc Description: PGP signature
Re: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
On Wed, Aug 26, 2020 at 9:42 AM Sait Talha Nisanci wrote: > I have tried combination of SSD, HDD, full_page_writes = on/off and > max_io_concurrency = 10/50, the recovery times are as follows (in seconds): > >No prefetch | Default prefetch > values | Default + max_io_concurrency = 50 > SSD, full_page_writes = on 852 301 > 197 > SSD, full_page_writes = off 16421359 > 1391 > HDD, full_page_writes = on 60276345 > 6390 > HDD, full_page_writes = off 738 275 > 192 The regression on HDD with full_page_writes=on is interesting. I don't know why that should happen, and I wonder if there is anything that can be done to mitigate it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
RE: [EXTERNAL] Re: WIP: WAL prefetch (another approach)
I have run some benchmarks for this patch. Overall it seems that there is a good improvement with the patch on recovery times: The VMs I used have 32GB RAM, pgbench is initialized with a scale factor 3000(so it doesn’t fit to memory, ~45GB). In order to avoid checkpoints during benchmark, max_wal_size(200GB) and checkpoint_timeout(200 mins) are set to a high value. The run is cancelled when there is a reasonable amount of WAL ( > 25GB). The recovery times are measured from the REDO logs. I have tried combination of SSD, HDD, full_page_writes = on/off and max_io_concurrency = 10/50, the recovery times are as follows (in seconds): No prefetch | Default prefetch values | Default + max_io_concurrency = 50 SSD, full_page_writes = on 852 301 197 SSD, full_page_writes = off 16421359 1391 HDD, full_page_writes = on 60276345 6390 HDD, full_page_writes = off 738 275 192 Default prefetch values: - Max_recovery_prefetch_distance = 256KB - Max_io_concurrency = 10 It probably makes sense to compare each row separately as the size of WAL can be different. Talha. -Original Message- From: Thomas Munro Sent: Thursday, August 13, 2020 9:57 AM To: Tomas Vondra Cc: Stephen Frost ; Dmitry Dolgov <9erthali...@gmail.com>; David Steele ; Andres Freund ; Alvaro Herrera ; pgsql-hackers Subject: [EXTERNAL] Re: WIP: WAL prefetch (another approach) On Thu, Aug 6, 2020 at 10:47 PM Tomas Vondra wrote: > On Thu, Aug 06, 2020 at 02:58:44PM +1200, Thomas Munro wrote: > >On Tue, Aug 4, 2020 at 3:47 AM Tomas Vondra > >> Any luck trying to reproduce thigs? Should I try again and collect > >> some additional debug info? > > > >No luck. I'm working on it now, and also trying to reduce the > >overheads so that we're not doing extra work when it doesn't help. > > OK, I'll see if I can still reproduce it. Since someone else ask me off-list, here's a rebase, with no functional changes. Soon I'll post a new improved version, but this version just fixes the bitrot and hopefully turns cfbot green.