Re: zstd compression for pg_dump
Justin Pryzby writes: > On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote: >> I think that's all for PG16 in this patch series. If there's more we want to >> do, it'll have to wait for PG17 - > Yes Shouldn't the CF entry be closed as committed? It's certainly making the cfbot unhappy because the patch-of-record doesn't apply. regards, tom lane
Re: zstd compression for pg_dump
On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote: > I looked at the long mode patch again, updated the commit message and > pushed it. I was wondering if long_mode should really be bool - > logically it is, but ZSTD_CCtx_setParameter() expects int. But I think > that's fine. Thanks! > I think that's all for PG16 in this patch series. If there's more we want to > do, it'll have to wait for PG17 - Yes > Justin, can you update and submit the patches that you think are relevant for > the next CF? Yeah. It sounds like a shiny new feature, but it's not totally clear if it's safe here or even how useful it is. (It might be like my patch for wal_compression=zstd:level, and Michael's for toast_compression=zstd, neither of which saw any support). Last year's basebackup thread had some interesting comments about safety of threads, although pg_dump's considerations may be different. The patch itself is trivial, so it'd be fine to wait until PG16 is released to get some experience. If someone else wanted to do that, it'd be fine with me. -- Justin
Re: zstd compression for pg_dump
On 4/5/23 21:42, Tomas Vondra wrote: > On 4/4/23 05:04, Justin Pryzby wrote: >> On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote: >>> On 4/3/23 21:17, Justin Pryzby wrote: On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote: >> Feel free to mess around with threads (but I'd much rather see the patch >> progress for zstd:long). > > OK, understood. The long mode patch is pretty simple. IIUC it does not > change the format, i.e. in the worst case we could leave it for PG17 > too. Correct? Right, libzstd only has one "format", which is the same as what's used by the commandline tool. zstd:long doesn't change the format of the output: the library just uses a larger memory buffer to allow better compression. There's no format change for zstd:workers, either. >>> >>> OK. I plan to do a bit more review/testing on this, and get it committed >>> over the next day or two, likely including the long mode. One thing I >>> noticed today is that maybe long_distance should be a bool, not int. >>> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be >>> cleaner to cast the value during a call and keep it bool otherwise. >> >> Thanks for noticing. Evidently I wrote it using "int" to get the >> feature working, and then later wrote the bool parsing bits but never >> changed the data structure. >> >> This also updates a few comments, indentation, removes a useless >> assertion, and updates the warning about zstd:workers. >> > > Thanks. I've cleaned up the 0001 a little bit (a couple comment > improvements), updated the commit message and pushed it. I plan to take > care of the 0002 (long distance mode) tomorrow, and that'll be it for > PG16 I think. I looked at the long mode patch again, updated the commit message and pushed it. I was wondering if long_mode should really be bool - logically it is, but ZSTD_CCtx_setParameter() expects int. But I think that's fine. I think that's all for PG16 in this patch series. If there's more we want to do, it'll have to wait for PG17 - Justin, can you update and submit the patches that you think are relevant for the next CF? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On 4/4/23 05:04, Justin Pryzby wrote: > On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote: >> On 4/3/23 21:17, Justin Pryzby wrote: >>> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote: > Feel free to mess around with threads (but I'd much rather see the patch > progress for zstd:long). OK, understood. The long mode patch is pretty simple. IIUC it does not change the format, i.e. in the worst case we could leave it for PG17 too. Correct? >>> >>> Right, libzstd only has one "format", which is the same as what's used >>> by the commandline tool. zstd:long doesn't change the format of the >>> output: the library just uses a larger memory buffer to allow better >>> compression. There's no format change for zstd:workers, either. >> >> OK. I plan to do a bit more review/testing on this, and get it committed >> over the next day or two, likely including the long mode. One thing I >> noticed today is that maybe long_distance should be a bool, not int. >> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be >> cleaner to cast the value during a call and keep it bool otherwise. > > Thanks for noticing. Evidently I wrote it using "int" to get the > feature working, and then later wrote the bool parsing bits but never > changed the data structure. > > This also updates a few comments, indentation, removes a useless > assertion, and updates the warning about zstd:workers. > Thanks. I've cleaned up the 0001 a little bit (a couple comment improvements), updated the commit message and pushed it. I plan to take care of the 0002 (long distance mode) tomorrow, and that'll be it for PG16 I think. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote: > On 4/3/23 21:17, Justin Pryzby wrote: > > On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote: > >>> Feel free to mess around with threads (but I'd much rather see the patch > >>> progress for zstd:long). > >> > >> OK, understood. The long mode patch is pretty simple. IIUC it does not > >> change the format, i.e. in the worst case we could leave it for PG17 > >> too. Correct? > > > > Right, libzstd only has one "format", which is the same as what's used > > by the commandline tool. zstd:long doesn't change the format of the > > output: the library just uses a larger memory buffer to allow better > > compression. There's no format change for zstd:workers, either. > > OK. I plan to do a bit more review/testing on this, and get it committed > over the next day or two, likely including the long mode. One thing I > noticed today is that maybe long_distance should be a bool, not int. > Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be > cleaner to cast the value during a call and keep it bool otherwise. Thanks for noticing. Evidently I wrote it using "int" to get the feature working, and then later wrote the bool parsing bits but never changed the data structure. This also updates a few comments, indentation, removes a useless assertion, and updates the warning about zstd:workers. -- Justin >From df0eb4d3c4799f24e58f1e5b0a9470e5af355ad6 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 7 Jan 2023 15:45:06 -0600 Subject: [PATCH 1/4] pg_dump: zstd compression Previously proposed at: 20201221194924.gi30...@telsasoft.com --- doc/src/sgml/ref/pg_dump.sgml | 13 +- src/bin/pg_dump/Makefile | 2 + src/bin/pg_dump/compress_io.c | 66 ++-- src/bin/pg_dump/compress_zstd.c | 537 ++ src/bin/pg_dump/compress_zstd.h | 25 ++ src/bin/pg_dump/meson.build | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 9 +- src/bin/pg_dump/pg_backup_directory.c | 2 + src/bin/pg_dump/pg_dump.c | 20 +- src/bin/pg_dump/t/002_pg_dump.pl | 79 +++- src/tools/pginclude/cpluspluscheck| 1 + src/tools/pgindent/typedefs.list | 1 + 12 files changed, 705 insertions(+), 54 deletions(-) create mode 100644 src/bin/pg_dump/compress_zstd.c create mode 100644 src/bin/pg_dump/compress_zstd.h diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 77299878e02..8de38e0fd0d 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -330,8 +330,9 @@ PostgreSQL documentation machine-readable format that pg_restore can read. A directory format archive can be manipulated with standard Unix tools; for example, files in an uncompressed archive - can be compressed with the gzip or - lz4 tools. + can be compressed with the gzip, + lz4, or + zstd tools. This format is compressed by default using gzip and also supports parallel dumps. @@ -655,7 +656,8 @@ PostgreSQL documentation Specify the compression method and/or the compression level to use. The compression method can be set to gzip, -lz4, or none for no compression. +lz4, zstd, +or none for no compression. A compression detail string can optionally be specified. If the detail string is an integer, it specifies the compression level. Otherwise, it should be a comma-separated list of items, each of the @@ -676,8 +678,9 @@ PostgreSQL documentation individual table-data segments, and the default is to compress using gzip at a moderate level. For plain text output, setting a nonzero compression level causes the entire output file to be compressed, -as though it had been fed through gzip or -lz4; but the default is not to compress. +as though it had been fed through gzip, +lz4, or zstd; +but the default is not to compress. The tar archive format currently does not support compression at all. diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile index eb8f59459a1..24de7593a6a 100644 --- a/src/bin/pg_dump/Makefile +++ b/src/bin/pg_dump/Makefile @@ -18,6 +18,7 @@ include $(top_builddir)/src/Makefile.global export GZIP_PROGRAM=$(GZIP) export LZ4 +export ZSTD export with_icu override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) @@ -29,6 +30,7 @@ OBJS = \ compress_io.o \ compress_lz4.o \ compress_none.o \ + compress_zstd.o \ dumputils.o \ parallel.o \ pg_backup_archiver.o \ diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index 0972a4f934a..4f06bb024f9 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -52,8 +52,8 @@ * * InitDiscoverCompressFileHandle tries to infer the
Re: zstd compression for pg_dump
On 4/3/23 21:17, Justin Pryzby wrote: > On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote: >>> Feel free to mess around with threads (but I'd much rather see the patch >>> progress for zstd:long). >> >> OK, understood. The long mode patch is pretty simple. IIUC it does not >> change the format, i.e. in the worst case we could leave it for PG17 >> too. Correct? > > Right, libzstd only has one "format", which is the same as what's used > by the commandline tool. zstd:long doesn't change the format of the > output: the library just uses a larger memory buffer to allow better > compression. There's no format change for zstd:workers, either. > OK. I plan to do a bit more review/testing on this, and get it committed over the next day or two, likely including the long mode. One thing I noticed today is that maybe long_distance should be a bool, not int. Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be cleaner to cast the value during a call and keep it bool otherwise. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote: > > Feel free to mess around with threads (but I'd much rather see the patch > > progress for zstd:long). > > OK, understood. The long mode patch is pretty simple. IIUC it does not > change the format, i.e. in the worst case we could leave it for PG17 > too. Correct? Right, libzstd only has one "format", which is the same as what's used by the commandline tool. zstd:long doesn't change the format of the output: the library just uses a larger memory buffer to allow better compression. There's no format change for zstd:workers, either. -- Justin
Re: zstd compression for pg_dump
On 4/1/23 15:36, Justin Pryzby wrote: > > ... > >> If there are no concerns, why disable it outside Windows? I don't have a >> good idea how beneficial the multi-threaded compression is, so I can't >> quite judge the risk/benefits tradeoff. > > Because it's a minor/fringe feature, and it's annoying to have platform > differences (would we plan on relaxing the restriction in v17, or is it > more likely we'd forget ?). > > I realized how little I've tested with zstd workers myself. And I think > on cirrusci, the macos and freebsd tasks have zstd libraries with > threading support, but it wasn't being exercised (because using :workers > would cause the patch to fail unless it's supported everywhere). So I > updated the "for CI only" patch to 1) use meson wraps to compile zstd > library with threading on linux and windows; and, 2) use zstd:workers=3 > "opportunistically" (but avoid failing if threads are not supported, > since the autoconf task still doesn't have access to a library with > thread support). That's a great step, but it still seems bad that the > thread stuff has been little exercised until now. (Also, the windows > task failed; I think that's due to a transient network issue). > Agreed, let's leave the threading for PG17, depending on how beneficial it turns out to be for pg_dump. > Feel free to mess around with threads (but I'd much rather see the patch > progress for zstd:long). OK, understood. The long mode patch is pretty simple. IIUC it does not change the format, i.e. in the worst case we could leave it for PG17 too. Correct? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Sat, Apr 01, 2023 at 02:49:44PM +0200, Tomas Vondra wrote: > On 4/1/23 02:28, Justin Pryzby wrote: > > On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote: > >> On 4/1/23 01:16, Justin Pryzby wrote: > >>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > wrote: > > I did some smoke testing against zstd's GitHub release on Windows. > > To > > build against it, I had to construct an import library, and put that > > and the DLL into the `lib` folder expected by the MSVC scripts... > > which makes me wonder if I've chosen a harder way than necessary? > > It looks like pg_dump's meson.build is missing dependencies on zstd > (meson couldn't find the headers in the subproject without them). > >>> > >>> I saw that this was added for LZ4, but I hadn't added it for zstd > >>> since > >>> I didn't run into an issue without it. Could you check that what I've > >>> added works for your case ? > >>> > > Parallel zstd dumps seem to work as expected, in that the resulting > > pg_restore output is identical to uncompressed dumps and nothing > > explodes. I haven't inspected the threading implementation for > > safety > > yet, as you mentioned. > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > handling safety for this, by isolating each thread's state. I don't > feel > comfortable pronouncing this new addition safe or not, because I'm > not > sure I understand what the comments in the format-specific _Clone() > callbacks are saying yet. > >>> > >>> My line of reasoning for unix is that pg_dump forks before any calls > >>> to > >>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that > >>> doesn't apply to pg_dump under windows. This is an opened question. > >>> If > >>> there's no solid answer, I could disable/ignore the option (maybe only > >>> under windows). > >> > >> I may be missing something, but why would the patch affect this? Why > >> would it even affect safety of the parallel dump? And I don't see any > >> changes to the clone stuff ... > > > > zstd supports using threads during compression, with -Z zstd:workers=N. > > When unix forks, the child processes can't do anything to mess up the > > state of the parent processes. > > > > But windows pg_dump uses threads instead of forking, so it seems > > possible that the pg_dump -j threads that then spawn zstd threads could > > "leak threads" and break the main thread. I suspect there's no issue, > > but we still ought to verify that before declaring it safe. > > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? > >>> > >>> I think that's what's best. I made it issue a warning if "workers" was > >>> specified. It could also be an error, or just ignored. > >>> > >>> I considered disabling workers only for windows, but realized that I > >>> haven't tested with threads myself - my local zstd package is compiled > >>> without threading, and I remember having some issue recompiling it with > >>> threading. Jacob's recipe for using meson wraps works well, but it > >>> still seems better to leave it as a future feature. I used that recipe > >>> to enabled zstd with threading on CI (except for linux/autoconf). > >> > >> +1 to disable this if we're unsure it works correctly. I agree it's > >> better to just error out if workers are requested - I rather dislike > >> when a tool just ignores an explicit parameter. And AFAICS it's what > >> zstd does too, when someone requests workers on incompatible build. > >> > >> FWIW I've been thinking about this a bit more and I don't quite see why > >> would the threading cause issues (except for Windows). I forgot > >> pg_basebackup already supports zstd, including the worker threading, so > >> why would it work there and not in pg_dump? Sure, pg_basebackup is not > >> parallel, but with separate pg_dump processes that shouldn't be an issue > >> (although I'm not sure when zstd creates threads). > > > > There's no concern at all except under windows (because on windows > > pg_dump -j is implemented using threads rather than forking). > > Especially since zstd:workers is already allowed in the basebackup > > backend process. > > If there are no concerns, why disable it outside Windows? I don't have a > good idea how beneficial the multi-threaded compression is, so I
Re: zstd compression for pg_dump
On 4/1/23 02:28, Justin Pryzby wrote: > On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote: >> On 4/1/23 01:16, Justin Pryzby wrote: >>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: On 3/27/23 19:28, Justin Pryzby wrote: > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: >> On 3/16/23 05:50, Justin Pryzby wrote: >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion wrote: > I did some smoke testing against zstd's GitHub release on Windows. To > build against it, I had to construct an import library, and put that > and the DLL into the `lib` folder expected by the MSVC scripts... > which makes me wonder if I've chosen a harder way than necessary? It looks like pg_dump's meson.build is missing dependencies on zstd (meson couldn't find the headers in the subproject without them). >>> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since >>> I didn't run into an issue without it. Could you check that what I've >>> added works for your case ? >>> > Parallel zstd dumps seem to work as expected, in that the resulting > pg_restore output is identical to uncompressed dumps and nothing > explodes. I haven't inspected the threading implementation for safety > yet, as you mentioned. Hm. Best I can tell, the CloneArchive() machinery is supposed to be handling safety for this, by isolating each thread's state. I don't feel comfortable pronouncing this new addition safe or not, because I'm not sure I understand what the comments in the format-specific _Clone() callbacks are saying yet. >>> >>> My line of reasoning for unix is that pg_dump forks before any calls to >>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that >>> doesn't apply to pg_dump under windows. This is an opened question. If >>> there's no solid answer, I could disable/ignore the option (maybe only >>> under windows). >> >> I may be missing something, but why would the patch affect this? Why >> would it even affect safety of the parallel dump? And I don't see any >> changes to the clone stuff ... > > zstd supports using threads during compression, with -Z zstd:workers=N. > When unix forks, the child processes can't do anything to mess up the > state of the parent processes. > > But windows pg_dump uses threads instead of forking, so it seems > possible that the pg_dump -j threads that then spawn zstd threads could > "leak threads" and break the main thread. I suspect there's no issue, > but we still ought to verify that before declaring it safe. OK. I don't have access to a Windows machine so I can't test that. Is it possible to disable the zstd threading, until we figure this out? >>> >>> I think that's what's best. I made it issue a warning if "workers" was >>> specified. It could also be an error, or just ignored. >>> >>> I considered disabling workers only for windows, but realized that I >>> haven't tested with threads myself - my local zstd package is compiled >>> without threading, and I remember having some issue recompiling it with >>> threading. Jacob's recipe for using meson wraps works well, but it >>> still seems better to leave it as a future feature. I used that recipe >>> to enabled zstd with threading on CI (except for linux/autoconf). >> >> +1 to disable this if we're unsure it works correctly. I agree it's >> better to just error out if workers are requested - I rather dislike >> when a tool just ignores an explicit parameter. And AFAICS it's what >> zstd does too, when someone requests workers on incompatible build. >> >> FWIW I've been thinking about this a bit more and I don't quite see why >> would the threading cause issues (except for Windows). I forgot >> pg_basebackup already supports zstd, including the worker threading, so >> why would it work there and not in pg_dump? Sure, pg_basebackup is not >> parallel, but with separate pg_dump processes that shouldn't be an issue >> (although I'm not sure when zstd creates threads). > > There's no concern at all except under windows (because on windows > pg_dump -j is implemented using threads rather than forking). > Especially since zstd:workers is already allowed in the basebackup > backend process. > If there are no concerns, why disable it outside Windows? I don't have a good idea how beneficial the multi-threaded compression is, so I can't quite judge the risk/benefits tradeoff. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote: > On 4/1/23 01:16, Justin Pryzby wrote: > > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: > >> On 3/27/23 19:28, Justin Pryzby wrote: > >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > On 3/16/23 05:50, Justin Pryzby wrote: > > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > >> wrote: > >>> I did some smoke testing against zstd's GitHub release on Windows. To > >>> build against it, I had to construct an import library, and put that > >>> and the DLL into the `lib` folder expected by the MSVC scripts... > >>> which makes me wonder if I've chosen a harder way than necessary? > >> > >> It looks like pg_dump's meson.build is missing dependencies on zstd > >> (meson couldn't find the headers in the subproject without them). > > > > I saw that this was added for LZ4, but I hadn't added it for zstd since > > I didn't run into an issue without it. Could you check that what I've > > added works for your case ? > > > >>> Parallel zstd dumps seem to work as expected, in that the resulting > >>> pg_restore output is identical to uncompressed dumps and nothing > >>> explodes. I haven't inspected the threading implementation for safety > >>> yet, as you mentioned. > >> > >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be > >> handling safety for this, by isolating each thread's state. I don't > >> feel > >> comfortable pronouncing this new addition safe or not, because I'm not > >> sure I understand what the comments in the format-specific _Clone() > >> callbacks are saying yet. > > > > My line of reasoning for unix is that pg_dump forks before any calls to > > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > > doesn't apply to pg_dump under windows. This is an opened question. If > > there's no solid answer, I could disable/ignore the option (maybe only > > under windows). > > I may be missing something, but why would the patch affect this? Why > would it even affect safety of the parallel dump? And I don't see any > changes to the clone stuff ... > >>> > >>> zstd supports using threads during compression, with -Z zstd:workers=N. > >>> When unix forks, the child processes can't do anything to mess up the > >>> state of the parent processes. > >>> > >>> But windows pg_dump uses threads instead of forking, so it seems > >>> possible that the pg_dump -j threads that then spawn zstd threads could > >>> "leak threads" and break the main thread. I suspect there's no issue, > >>> but we still ought to verify that before declaring it safe. > >> > >> OK. I don't have access to a Windows machine so I can't test that. Is it > >> possible to disable the zstd threading, until we figure this out? > > > > I think that's what's best. I made it issue a warning if "workers" was > > specified. It could also be an error, or just ignored. > > > > I considered disabling workers only for windows, but realized that I > > haven't tested with threads myself - my local zstd package is compiled > > without threading, and I remember having some issue recompiling it with > > threading. Jacob's recipe for using meson wraps works well, but it > > still seems better to leave it as a future feature. I used that recipe > > to enabled zstd with threading on CI (except for linux/autoconf). > > +1 to disable this if we're unsure it works correctly. I agree it's > better to just error out if workers are requested - I rather dislike > when a tool just ignores an explicit parameter. And AFAICS it's what > zstd does too, when someone requests workers on incompatible build. > > FWIW I've been thinking about this a bit more and I don't quite see why > would the threading cause issues (except for Windows). I forgot > pg_basebackup already supports zstd, including the worker threading, so > why would it work there and not in pg_dump? Sure, pg_basebackup is not > parallel, but with separate pg_dump processes that shouldn't be an issue > (although I'm not sure when zstd creates threads). There's no concern at all except under windows (because on windows pg_dump -j is implemented using threads rather than forking). Especially since zstd:workers is already allowed in the basebackup backend process. > I'll try building zstd with threading enabled, and do some tests over > the weekend. Feel free to wait until v17 :) I used "meson wraps" to get a local version with threading. Note that if you want to use a zstd subproject, you may have to specify -D zstd=enabled, or else meson may not enable the library at all. Also, in order to introspect its settings, I had to do like this: mkdir subprojects meson wrap install zstd meson subprojects download mkdir build.meson meson setup -C build.meson
Re: zstd compression for pg_dump
On 4/1/23 01:16, Justin Pryzby wrote: > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: >> On 3/27/23 19:28, Justin Pryzby wrote: >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: On 3/16/23 05:50, Justin Pryzby wrote: > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion >> wrote: >>> I did some smoke testing against zstd's GitHub release on Windows. To >>> build against it, I had to construct an import library, and put that >>> and the DLL into the `lib` folder expected by the MSVC scripts... >>> which makes me wonder if I've chosen a harder way than necessary? >> >> It looks like pg_dump's meson.build is missing dependencies on zstd >> (meson couldn't find the headers in the subproject without them). > > I saw that this was added for LZ4, but I hadn't added it for zstd since > I didn't run into an issue without it. Could you check that what I've > added works for your case ? > >>> Parallel zstd dumps seem to work as expected, in that the resulting >>> pg_restore output is identical to uncompressed dumps and nothing >>> explodes. I haven't inspected the threading implementation for safety >>> yet, as you mentioned. >> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be >> handling safety for this, by isolating each thread's state. I don't feel >> comfortable pronouncing this new addition safe or not, because I'm not >> sure I understand what the comments in the format-specific _Clone() >> callbacks are saying yet. > > My line of reasoning for unix is that pg_dump forks before any calls to > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > doesn't apply to pg_dump under windows. This is an opened question. If > there's no solid answer, I could disable/ignore the option (maybe only > under windows). I may be missing something, but why would the patch affect this? Why would it even affect safety of the parallel dump? And I don't see any changes to the clone stuff ... >>> >>> zstd supports using threads during compression, with -Z zstd:workers=N. >>> When unix forks, the child processes can't do anything to mess up the >>> state of the parent processes. >>> >>> But windows pg_dump uses threads instead of forking, so it seems >>> possible that the pg_dump -j threads that then spawn zstd threads could >>> "leak threads" and break the main thread. I suspect there's no issue, >>> but we still ought to verify that before declaring it safe. >> >> OK. I don't have access to a Windows machine so I can't test that. Is it >> possible to disable the zstd threading, until we figure this out? > > I think that's what's best. I made it issue a warning if "workers" was > specified. It could also be an error, or just ignored. > > I considered disabling workers only for windows, but realized that I > haven't tested with threads myself - my local zstd package is compiled > without threading, and I remember having some issue recompiling it with > threading. Jacob's recipe for using meson wraps works well, but it > still seems better to leave it as a future feature. I used that recipe > to enabled zstd with threading on CI (except for linux/autoconf). > +1 to disable this if we're unsure it works correctly. I agree it's better to just error out if workers are requested - I rather dislike when a tool just ignores an explicit parameter. And AFAICS it's what zstd does too, when someone requests workers on incompatible build. FWIW I've been thinking about this a bit more and I don't quite see why would the threading cause issues (except for Windows). I forgot pg_basebackup already supports zstd, including the worker threading, so why would it work there and not in pg_dump? Sure, pg_basebackup is not parallel, but with separate pg_dump processes that shouldn't be an issue (although I'm not sure when zstd creates threads). The one thing I'm wondering about is at which point are the worker threads spawned - but presumably not before the pg_dump processes fork. I'll try building zstd with threading enabled, and do some tests over the weekend. > The function is first checking if it was passed a filename which already > has a suffix. And if not, it searches through a list of suffixes, > testing for an existing file with each suffix. The search with stat() > doesn't happen if it has a suffix. I'm having trouble seeing how the > hasSuffix() branch isn't dead code. Another opened question. AFAICS it's done this way because of this comment in pg_backup_directory * ... * ".gz" suffix is added to the filenames. The TOC files are never * compressed by pg_dump, however they are accepted with the .gz suffix * too, in case the user has manually compressed them with 'gzip'. I
Re: zstd compression for pg_dump
On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote: > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > wrote: > > I did some smoke testing against zstd's GitHub release on Windows. To > > build against it, I had to construct an import library, and put that > > and the DLL into the `lib` folder expected by the MSVC scripts... > > which makes me wonder if I've chosen a harder way than necessary? > > It looks like pg_dump's meson.build is missing dependencies on zstd > (meson couldn't find the headers in the subproject without them). > >>> > >>> I saw that this was added for LZ4, but I hadn't added it for zstd since > >>> I didn't run into an issue without it. Could you check that what I've > >>> added works for your case ? > >>> > > Parallel zstd dumps seem to work as expected, in that the resulting > > pg_restore output is identical to uncompressed dumps and nothing > > explodes. I haven't inspected the threading implementation for safety > > yet, as you mentioned. > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > handling safety for this, by isolating each thread's state. I don't feel > comfortable pronouncing this new addition safe or not, because I'm not > sure I understand what the comments in the format-specific _Clone() > callbacks are saying yet. > >>> > >>> My line of reasoning for unix is that pg_dump forks before any calls to > >>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that > >>> doesn't apply to pg_dump under windows. This is an opened question. If > >>> there's no solid answer, I could disable/ignore the option (maybe only > >>> under windows). > >> > >> I may be missing something, but why would the patch affect this? Why > >> would it even affect safety of the parallel dump? And I don't see any > >> changes to the clone stuff ... > > > > zstd supports using threads during compression, with -Z zstd:workers=N. > > When unix forks, the child processes can't do anything to mess up the > > state of the parent processes. > > > > But windows pg_dump uses threads instead of forking, so it seems > > possible that the pg_dump -j threads that then spawn zstd threads could > > "leak threads" and break the main thread. I suspect there's no issue, > > but we still ought to verify that before declaring it safe. > > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? I think that's what's best. I made it issue a warning if "workers" was specified. It could also be an error, or just ignored. I considered disabling workers only for windows, but realized that I haven't tested with threads myself - my local zstd package is compiled without threading, and I remember having some issue recompiling it with threading. Jacob's recipe for using meson wraps works well, but it still seems better to leave it as a future feature. I used that recipe to enabled zstd with threading on CI (except for linux/autoconf). > >>> The function is first checking if it was passed a filename which already > >>> has a suffix. And if not, it searches through a list of suffixes, > >>> testing for an existing file with each suffix. The search with stat() > >>> doesn't happen if it has a suffix. I'm having trouble seeing how the > >>> hasSuffix() branch isn't dead code. Another opened question. > >> > >> AFAICS it's done this way because of this comment in pg_backup_directory > >> > >> * ... > >> * ".gz" suffix is added to the filenames. The TOC files are never > >> * compressed by pg_dump, however they are accepted with the .gz suffix > >> * too, in case the user has manually compressed them with 'gzip'. > >> > >> I haven't tried, but I believe that if you manually compress the > >> directory, it may hit this branch. > > > > That would make sense, but when I tried, it didn't work like that. > > The filenames were all uncompressed names. Maybe it worked differently > > in an older release. Or maybe it changed during development of the > > parallel-directory-dump patch and it's actually dead code. > > Interesting. Would be good to find out. I wonder if a little bit of > git-log digging could tell us more. Anyway, until we confirm it's dead > code, we should probably do what .gz does and have the same check for > .lz4 and .zst files. I found that hasSuffix() and cfopen() originated in the refactored patch Heikki's sent here; there's no history beyond that. https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com The patch published there appends the .gz within cfopen(), and the caller writes into the TOC the filename without ".gz". It seems like
Re: zstd compression for pg_dump
On Wed, Mar 29, 2023 at 6:35 AM Justin Pryzby wrote: > If you have a zstd library with thread support, you could test with > -Z zstd:workers=3. But I think threads aren't enabled in the common > libzstd packages. Jacob figured out how to compile libzstd easily using > "meson wraps" - but I don't know the details. >From the source root, $ mkdir subprojects $ meson wrap install zstd >From then on, Meson was pretty automagical about it during the ninja build. The subproject's settings are themselves inspectable and settable via `meson configure`: $ meson configure -Dzstd:= --Jacob
Re: zstd compression for pg_dump
On Tue, Mar 28, 2023 at 02:03:49PM -0400, Kirk Wolak wrote: > On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra > wrote: > > On 3/27/23 19:28, Justin Pryzby wrote: > > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > > >> On 3/16/23 05:50, Justin Pryzby wrote: > > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > > wrote: > > > I did some smoke testing against zstd's GitHub release on Windows. To > > ... > > OK. I don't have access to a Windows machine so I can't test that. Is it > > possible to disable the zstd threading, until we figure this out? > > Thomas since I appear to be one of the few windows users (I use both), can I > help? > I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a day > on windows while developing. It'd be great if you'd exercise this and other changes to pg_dump/restore. Tomas just pushed a bugfix, so be sure to "git pull" before testing, or else you might rediscover the bug. If you have a zstd library with thread support, you could test with -Z zstd:workers=3. But I think threads aren't enabled in the common libzstd packages. Jacob figured out how to compile libzstd easily using "meson wraps" - but I don't know the details. -- Justin
Re: zstd compression for pg_dump
On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby wrote: > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > > It looks like pg_dump's meson.build is missing dependencies on zstd > > (meson couldn't find the headers in the subproject without them). > > I saw that this was added for LZ4, but I hadn't added it for zstd since > I didn't run into an issue without it. Could you check that what I've > added works for your case ? I thought I replied to this, sorry -- your newest patchset builds correctly with subprojects, so the new dependency looks good to me. Thanks! > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > > handling safety for this, by isolating each thread's state. I don't feel > > comfortable pronouncing this new addition safe or not, because I'm not > > sure I understand what the comments in the format-specific _Clone() > > callbacks are saying yet. > > My line of reasoning for unix is that pg_dump forks before any calls to > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > doesn't apply to pg_dump under windows. This is an opened question. If > there's no solid answer, I could disable/ignore the option (maybe only > under windows). To (maybe?) move this forward a bit, note that pg_backup_custom's _Clone() function makes sure that there is no active compressor state at the beginning of the new thread. pg_backup_directory's implementation has no such provision. And I don't think it can, because the parent thread might have concurrently set one up -- see the directory-specific implementation of _CloseArchive(). Perhaps we should just NULL out those fields after the copy, instead? To illustrate why I think this is tough to characterize: if I've read the code correctly, the _Clone() and CloneArchive() implementations are running concurrently with code that is actively modifying the ArchiveHandle and the lclContext. So safety is only ensured to the extent that we keep track of which fields threads are allowed to touch, and I don't have that mental model. --Jacob
Re: zstd compression for pg_dump
On 3/28/23 20:03, Kirk Wolak wrote: > On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra > mailto:tomas.von...@enterprisedb.com>> > wrote: > > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > mailto:jchamp...@timescale.com>> wrote: > > I did some smoke testing against zstd's GitHub release on > Windows. To > ... > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? > > Thomas since I appear to be one of the few windows users (I use both), > can I help? > I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a > day on windows while developing. > Perhaps. But I'll leave the details up to Justin - it's his patch, and I'm not sure how to verify the threading is OK. I'd try applying this patch, build with --with-zstd and then run the pg_dump TAP tests, and perhaps do some manual tests. And perhaps do the same for --with-lz4 - there's a thread [1] suggesting we don't detect lz4 stuff on Windows, so the TAP tests do nothing. https://www.postgresql.org/message-id/zajl96n9zw84u...@msg.df7cb.de > Also, I have an AWS instance I created to build PG/Win with readline > back in November. > I could give you access to that... (you are not the only person who has > made this statement here). > I've made such instances available for other Open Source developers, to > support them. > > Obvi I would share connection credentials privately. > I'd rather leave the Windows stuff up to someone with more experience with that platform. I have plenty of other stuff on my plate atm. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra wrote: > On 3/27/23 19:28, Justin Pryzby wrote: > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > >> On 3/16/23 05:50, Justin Pryzby wrote: > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion < > jchamp...@timescale.com> wrote: > > I did some smoke testing against zstd's GitHub release on Windows. To > ... > OK. I don't have access to a Windows machine so I can't test that. Is it > possible to disable the zstd threading, until we figure this out? > > Thomas since I appear to be one of the few windows users (I use both), can I help? I can test pg_dump... for you, easy to do. I do about 5-10 pg_dumps a day on windows while developing. Also, I have an AWS instance I created to build PG/Win with readline back in November. I could give you access to that... (you are not the only person who has made this statement here). I've made such instances available for other Open Source developers, to support them. Obvi I would share connection credentials privately. Regards, Kirk
Re: zstd compression for pg_dump
On 3/27/23 19:28, Justin Pryzby wrote: > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: >> On 3/16/23 05:50, Justin Pryzby wrote: >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion wrote: > I did some smoke testing against zstd's GitHub release on Windows. To > build against it, I had to construct an import library, and put that > and the DLL into the `lib` folder expected by the MSVC scripts... > which makes me wonder if I've chosen a harder way than necessary? It looks like pg_dump's meson.build is missing dependencies on zstd (meson couldn't find the headers in the subproject without them). >>> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since >>> I didn't run into an issue without it. Could you check that what I've >>> added works for your case ? >>> > Parallel zstd dumps seem to work as expected, in that the resulting > pg_restore output is identical to uncompressed dumps and nothing > explodes. I haven't inspected the threading implementation for safety > yet, as you mentioned. Hm. Best I can tell, the CloneArchive() machinery is supposed to be handling safety for this, by isolating each thread's state. I don't feel comfortable pronouncing this new addition safe or not, because I'm not sure I understand what the comments in the format-specific _Clone() callbacks are saying yet. >>> >>> My line of reasoning for unix is that pg_dump forks before any calls to >>> zstd. Nothing zstd does ought to affect the pg_dump layer. But that >>> doesn't apply to pg_dump under windows. This is an opened question. If >>> there's no solid answer, I could disable/ignore the option (maybe only >>> under windows). >> >> I may be missing something, but why would the patch affect this? Why >> would it even affect safety of the parallel dump? And I don't see any >> changes to the clone stuff ... > > zstd supports using threads during compression, with -Z zstd:workers=N. > When unix forks, the child processes can't do anything to mess up the > state of the parent processes. > > But windows pg_dump uses threads instead of forking, so it seems > possible that the pg_dump -j threads that then spawn zstd threads could > "leak threads" and break the main thread. I suspect there's no issue, > but we still ought to verify that before declaring it safe. > OK. I don't have access to a Windows machine so I can't test that. Is it possible to disable the zstd threading, until we figure this out? >>> The function is first checking if it was passed a filename which already >>> has a suffix. And if not, it searches through a list of suffixes, >>> testing for an existing file with each suffix. The search with stat() >>> doesn't happen if it has a suffix. I'm having trouble seeing how the >>> hasSuffix() branch isn't dead code. Another opened question. >> >> AFAICS it's done this way because of this comment in pg_backup_directory >> >> * ... >> * ".gz" suffix is added to the filenames. The TOC files are never >> * compressed by pg_dump, however they are accepted with the .gz suffix >> * too, in case the user has manually compressed them with 'gzip'. >> >> I haven't tried, but I believe that if you manually compress the >> directory, it may hit this branch. > > That would make sense, but when I tried, it didn't work like that. > The filenames were all uncompressed names. Maybe it worked differently > in an older release. Or maybe it changed during development of the > parallel-directory-dump patch and it's actually dead code. > Interesting. Would be good to find out. I wonder if a little bit of git-log digging could tell us more. Anyway, until we confirm it's dead code, we should probably do what .gz does and have the same check for .lz4 and .zst files. > This is rebased over the updated compression API. > > It seems like I misunderstood something you said before, so now I put > back "supports_compression()". > Thanks! I need to do a bit more testing and review, but it seems pretty much RFC to me, assuming we can figure out what to do about threading. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote: > On 3/16/23 05:50, Justin Pryzby wrote: > > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > >> wrote: > >>> I did some smoke testing against zstd's GitHub release on Windows. To > >>> build against it, I had to construct an import library, and put that > >>> and the DLL into the `lib` folder expected by the MSVC scripts... > >>> which makes me wonder if I've chosen a harder way than necessary? > >> > >> It looks like pg_dump's meson.build is missing dependencies on zstd > >> (meson couldn't find the headers in the subproject without them). > > > > I saw that this was added for LZ4, but I hadn't added it for zstd since > > I didn't run into an issue without it. Could you check that what I've > > added works for your case ? > > > >>> Parallel zstd dumps seem to work as expected, in that the resulting > >>> pg_restore output is identical to uncompressed dumps and nothing > >>> explodes. I haven't inspected the threading implementation for safety > >>> yet, as you mentioned. > >> > >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be > >> handling safety for this, by isolating each thread's state. I don't feel > >> comfortable pronouncing this new addition safe or not, because I'm not > >> sure I understand what the comments in the format-specific _Clone() > >> callbacks are saying yet. > > > > My line of reasoning for unix is that pg_dump forks before any calls to > > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > > doesn't apply to pg_dump under windows. This is an opened question. If > > there's no solid answer, I could disable/ignore the option (maybe only > > under windows). > > I may be missing something, but why would the patch affect this? Why > would it even affect safety of the parallel dump? And I don't see any > changes to the clone stuff ... zstd supports using threads during compression, with -Z zstd:workers=N. When unix forks, the child processes can't do anything to mess up the state of the parent processes. But windows pg_dump uses threads instead of forking, so it seems possible that the pg_dump -j threads that then spawn zstd threads could "leak threads" and break the main thread. I suspect there's no issue, but we still ought to verify that before declaring it safe. > > The function is first checking if it was passed a filename which already > > has a suffix. And if not, it searches through a list of suffixes, > > testing for an existing file with each suffix. The search with stat() > > doesn't happen if it has a suffix. I'm having trouble seeing how the > > hasSuffix() branch isn't dead code. Another opened question. > > AFAICS it's done this way because of this comment in pg_backup_directory > > * ... > * ".gz" suffix is added to the filenames. The TOC files are never > * compressed by pg_dump, however they are accepted with the .gz suffix > * too, in case the user has manually compressed them with 'gzip'. > > I haven't tried, but I believe that if you manually compress the > directory, it may hit this branch. That would make sense, but when I tried, it didn't work like that. The filenames were all uncompressed names. Maybe it worked differently in an older release. Or maybe it changed during development of the parallel-directory-dump patch and it's actually dead code. This is rebased over the updated compression API. It seems like I misunderstood something you said before, so now I put back "supports_compression()". -- Justin >From b4cf9df2f8672b012301d58ff83f2b0344de9e96 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 7 Jan 2023 15:45:06 -0600 Subject: [PATCH 1/3] pg_dump: zstd compression Previously proposed at: 20201221194924.gi30...@telsasoft.com --- doc/src/sgml/ref/pg_dump.sgml | 13 +- src/bin/pg_dump/Makefile | 2 + src/bin/pg_dump/compress_io.c | 62 ++- src/bin/pg_dump/compress_zstd.c | 535 ++ src/bin/pg_dump/compress_zstd.h | 25 ++ src/bin/pg_dump/meson.build | 4 +- src/bin/pg_dump/pg_backup_archiver.c | 9 +- src/bin/pg_dump/pg_backup_directory.c | 2 + src/bin/pg_dump/pg_dump.c | 16 +- src/bin/pg_dump/t/002_pg_dump.pl | 79 +++- src/tools/pginclude/cpluspluscheck| 1 + 11 files changed, 694 insertions(+), 54 deletions(-) create mode 100644 src/bin/pg_dump/compress_zstd.c create mode 100644 src/bin/pg_dump/compress_zstd.h diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index d6b1faa8042..62b3ed2dad6 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -330,8 +330,9 @@ PostgreSQL documentation machine-readable format that pg_restore can read. A directory format archive can be manipulated with standard Unix tools; for example, files in an uncompressed
Re: zstd compression for pg_dump
Hi, On 3/17/23 03:43, Tomas Vondra wrote: > > ... > >>> I'm a little suspicious of the replacement of supports_compression() >>> with parse_compress_specification(). For example: >>> - errmsg = supports_compression(AH->compression_spec); - if (errmsg) + parse_compress_specification(AH->compression_spec.algorithm, + NULL, _spec); + if (compress_spec.parse_error != NULL) { pg_log_warning("archive is compressed, but this installation does not support compression (%s - errmsg); - pg_free(errmsg); + compress_spec.parse_error); + pg_free(compress_spec.parse_error); } >>> >>> The top-level error here is "does not support compression", but wouldn't >>> a bad specification option with a supported compression method trip this >>> path too? >> >> No - since the 2nd argument is passed as NULL, it just checks whether >> the compression is supported. Maybe there ought to be a more >> direct/clean way to do it. But up to now evidently nobody needed to do >> that. >> > > I don't think the patch can use parse_compress_specification() instead > of replace supports_compression(). The parsing simply determines if the > build has the library, it doesn't say if a particular tool was modified > to support the algorithm. I might build --with-zstd and yet pg_dump does > not support that algorithm yet. > > Even after we add zstd to pg_dump, it's quite likely other compression > algorithms may not be supported by pg_dump from day 1. > > > I haven't looked at / tested the patch yet, but I wonder if you have any > thoughts regarding the size_t / int tweaks. I don't know what types zstd > library uses, how it reports errors etc. > Any thoughts regarding my comments on removing supports_compression()? Also, this patch needs a rebase to adopt it to the API changes from last week. The sooner the better, considering we're getting fairly close to the end of the CF and code freeze. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On 3/16/23 05:50, Justin Pryzby wrote: > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion >> wrote: >>> I did some smoke testing against zstd's GitHub release on Windows. To >>> build against it, I had to construct an import library, and put that >>> and the DLL into the `lib` folder expected by the MSVC scripts... >>> which makes me wonder if I've chosen a harder way than necessary? >> >> It looks like pg_dump's meson.build is missing dependencies on zstd >> (meson couldn't find the headers in the subproject without them). > > I saw that this was added for LZ4, but I hadn't added it for zstd since > I didn't run into an issue without it. Could you check that what I've > added works for your case ? > >>> Parallel zstd dumps seem to work as expected, in that the resulting >>> pg_restore output is identical to uncompressed dumps and nothing >>> explodes. I haven't inspected the threading implementation for safety >>> yet, as you mentioned. >> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be >> handling safety for this, by isolating each thread's state. I don't feel >> comfortable pronouncing this new addition safe or not, because I'm not >> sure I understand what the comments in the format-specific _Clone() >> callbacks are saying yet. > > My line of reasoning for unix is that pg_dump forks before any calls to > zstd. Nothing zstd does ought to affect the pg_dump layer. But that > doesn't apply to pg_dump under windows. This is an opened question. If > there's no solid answer, I could disable/ignore the option (maybe only > under windows). > I may be missing something, but why would the patch affect this? Why would it even affect safety of the parallel dump? And I don't see any changes to the clone stuff ... >> On to code (not a complete review): >> >>> if (hasSuffix(fname, ".gz")) >>> compression_spec.algorithm = PG_COMPRESSION_GZIP; >>> else >>> { >>> boolexists; >>> >>> exists = (stat(path, ) == 0); >>> /* avoid unused warning if it is not built with compression */ >>> if (exists) >>> compression_spec.algorithm = PG_COMPRESSION_NONE; >>> -#ifdef HAVE_LIBZ >>> - if (!exists) >>> - { >>> - free_keep_errno(fname); >>> - fname = psprintf("%s.gz", path); >>> - exists = (stat(fname, ) == 0); >>> - >>> - if (exists) >>> - compression_spec.algorithm = PG_COMPRESSION_GZIP; >>> - } >>> -#endif >>> -#ifdef USE_LZ4 >>> - if (!exists) >>> - { >>> - free_keep_errno(fname); >>> - fname = psprintf("%s.lz4", path); >>> - exists = (stat(fname, ) == 0); >>> - >>> - if (exists) >>> - compression_spec.algorithm = PG_COMPRESSION_LZ4; >>> - } >>> -#endif >>> + else if (check_compressed_file(path, , "gz")) >>> + compression_spec.algorithm = PG_COMPRESSION_GZIP; >>> + else if (check_compressed_file(path, , "lz4")) >>> + compression_spec.algorithm = PG_COMPRESSION_LZ4; >>> + else if (check_compressed_file(path, , "zst")) >>> + compression_spec.algorithm = PG_COMPRESSION_ZSTD; >>> } >> >> This function lost some coherence, I think. Should there be a hasSuffix >> check at the top for ".zstd" (and, for that matter, ".lz4")? > This was discussed in the lz4 thread a couple days, and I think there should be hasSuffix() cases for lz4/zstd too, not just for .gz. > The function is first checking if it was passed a filename which already > has a suffix. And if not, it searches through a list of suffixes, > testing for an existing file with each suffix. The search with stat() > doesn't happen if it has a suffix. I'm having trouble seeing how the > hasSuffix() branch isn't dead code. Another opened question. > AFAICS it's done this way because of this comment in pg_backup_directory * ... * ".gz" suffix is added to the filenames. The TOC files are never * compressed by pg_dump, however they are accepted with the .gz suffix * too, in case the user has manually compressed them with 'gzip'. I haven't tried, but I believe that if you manually compress the directory, it may hit this branch. And IMO if we support that for gzip, the other compression methods should do that too for consistency. In any case, it's a tiny amount of code and I don't feel like ripping that out when it might break some currently supported use case. >> I'm a little suspicious of the replacement of supports_compression() >> with parse_compress_specification(). For example: >> >>> - errmsg = supports_compression(AH->compression_spec); >>> - if (errmsg) >>> + parse_compress_specification(AH->compression_spec.algorithm, >>> + NULL, _spec); >>> + if (compress_spec.parse_error != NULL) >>> { >>> pg_log_warning("archive is compressed, but this installation does >>> not
Re: zstd compression for pg_dump
On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote: > On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion > wrote: > > I did some smoke testing against zstd's GitHub release on Windows. To > > build against it, I had to construct an import library, and put that > > and the DLL into the `lib` folder expected by the MSVC scripts... > > which makes me wonder if I've chosen a harder way than necessary? > > It looks like pg_dump's meson.build is missing dependencies on zstd > (meson couldn't find the headers in the subproject without them). I saw that this was added for LZ4, but I hadn't added it for zstd since I didn't run into an issue without it. Could you check that what I've added works for your case ? > > Parallel zstd dumps seem to work as expected, in that the resulting > > pg_restore output is identical to uncompressed dumps and nothing > > explodes. I haven't inspected the threading implementation for safety > > yet, as you mentioned. > > Hm. Best I can tell, the CloneArchive() machinery is supposed to be > handling safety for this, by isolating each thread's state. I don't feel > comfortable pronouncing this new addition safe or not, because I'm not > sure I understand what the comments in the format-specific _Clone() > callbacks are saying yet. My line of reasoning for unix is that pg_dump forks before any calls to zstd. Nothing zstd does ought to affect the pg_dump layer. But that doesn't apply to pg_dump under windows. This is an opened question. If there's no solid answer, I could disable/ignore the option (maybe only under windows). > On to code (not a complete review): > > > if (hasSuffix(fname, ".gz")) > > compression_spec.algorithm = PG_COMPRESSION_GZIP; > > else > > { > > boolexists; > > > > exists = (stat(path, ) == 0); > > /* avoid unused warning if it is not built with compression */ > > if (exists) > > compression_spec.algorithm = PG_COMPRESSION_NONE; > > -#ifdef HAVE_LIBZ > > - if (!exists) > > - { > > - free_keep_errno(fname); > > - fname = psprintf("%s.gz", path); > > - exists = (stat(fname, ) == 0); > > - > > - if (exists) > > - compression_spec.algorithm = PG_COMPRESSION_GZIP; > > - } > > -#endif > > -#ifdef USE_LZ4 > > - if (!exists) > > - { > > - free_keep_errno(fname); > > - fname = psprintf("%s.lz4", path); > > - exists = (stat(fname, ) == 0); > > - > > - if (exists) > > - compression_spec.algorithm = PG_COMPRESSION_LZ4; > > - } > > -#endif > > + else if (check_compressed_file(path, , "gz")) > > + compression_spec.algorithm = PG_COMPRESSION_GZIP; > > + else if (check_compressed_file(path, , "lz4")) > > + compression_spec.algorithm = PG_COMPRESSION_LZ4; > > + else if (check_compressed_file(path, , "zst")) > > + compression_spec.algorithm = PG_COMPRESSION_ZSTD; > > } > > This function lost some coherence, I think. Should there be a hasSuffix > check at the top for ".zstd" (and, for that matter, ".lz4")? The function is first checking if it was passed a filename which already has a suffix. And if not, it searches through a list of suffixes, testing for an existing file with each suffix. The search with stat() doesn't happen if it has a suffix. I'm having trouble seeing how the hasSuffix() branch isn't dead code. Another opened question. > I'm a little suspicious of the replacement of supports_compression() > with parse_compress_specification(). For example: > > > - errmsg = supports_compression(AH->compression_spec); > > - if (errmsg) > > + parse_compress_specification(AH->compression_spec.algorithm, > > + NULL, _spec); > > + if (compress_spec.parse_error != NULL) > > { > > pg_log_warning("archive is compressed, but this installation does > > not support compression (%s > > - errmsg); > > - pg_free(errmsg); > > + compress_spec.parse_error); > > + pg_free(compress_spec.parse_error); > > } > > The top-level error here is "does not support compression", but wouldn't > a bad specification option with a supported compression method trip this > path too? No - since the 2nd argument is passed as NULL, it just checks whether the compression is supported. Maybe there ought to be a more direct/clean way to do it. But up to now evidently nobody needed to do that. > > +static void > > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream, > > + ZSTD_cParameter param, int value, char *paramname) > > IMO we should avoid stepping on the ZSTD_ namespace with our own > internal function names. done > > + if (cs->readF != NULL) > > + > > + if (cs->writeF != NULL) > > This seems to suggest that both cs->readF and cs->writeF could be set, > but in that case, the output buffer gets
Re: zstd compression for pg_dump
Hi, This'll need another rebase over the meson ICU changes. On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion wrote: > I did some smoke testing against zstd's GitHub release on Windows. To > build against it, I had to construct an import library, and put that > and the DLL into the `lib` folder expected by the MSVC scripts... > which makes me wonder if I've chosen a harder way than necessary? A meson wrap made this much easier! It looks like pg_dump's meson.build is missing dependencies on zstd (meson couldn't find the headers in the subproject without them). > Parallel zstd dumps seem to work as expected, in that the resulting > pg_restore output is identical to uncompressed dumps and nothing > explodes. I haven't inspected the threading implementation for safety > yet, as you mentioned. Hm. Best I can tell, the CloneArchive() machinery is supposed to be handling safety for this, by isolating each thread's state. I don't feel comfortable pronouncing this new addition safe or not, because I'm not sure I understand what the comments in the format-specific _Clone() callbacks are saying yet. > And I still wasn't able to test :workers, since > it looks like the official libzstd for Windows isn't built for > multithreading. That'll be another day's project. The wrapped installation enabled threading too, so I was able to try with :workers=8. Everything seems to work, but I didn't have a dataset that showed speed improvements at the time. It did seem to affect the overall compressibility negatively -- which makes sense, I think, assuming each thread is looking at a separate and/or smaller window. On to code (not a complete review): > if (hasSuffix(fname, ".gz")) > compression_spec.algorithm = PG_COMPRESSION_GZIP; > else > { > boolexists; > > exists = (stat(path, ) == 0); > /* avoid unused warning if it is not built with compression */ > if (exists) > compression_spec.algorithm = PG_COMPRESSION_NONE; > -#ifdef HAVE_LIBZ > - if (!exists) > - { > - free_keep_errno(fname); > - fname = psprintf("%s.gz", path); > - exists = (stat(fname, ) == 0); > - > - if (exists) > - compression_spec.algorithm = PG_COMPRESSION_GZIP; > - } > -#endif > -#ifdef USE_LZ4 > - if (!exists) > - { > - free_keep_errno(fname); > - fname = psprintf("%s.lz4", path); > - exists = (stat(fname, ) == 0); > - > - if (exists) > - compression_spec.algorithm = PG_COMPRESSION_LZ4; > - } > -#endif > + else if (check_compressed_file(path, , "gz")) > + compression_spec.algorithm = PG_COMPRESSION_GZIP; > + else if (check_compressed_file(path, , "lz4")) > + compression_spec.algorithm = PG_COMPRESSION_LZ4; > + else if (check_compressed_file(path, , "zst")) > + compression_spec.algorithm = PG_COMPRESSION_ZSTD; > } This function lost some coherence, I think. Should there be a hasSuffix check at the top for ".zstd" (and, for that matter, ".lz4")? And the comment references an unused warning, which is only possible with the #ifdef blocks that were removed. I'm a little suspicious of the replacement of supports_compression() with parse_compress_specification(). For example: > - errmsg = supports_compression(AH->compression_spec); > - if (errmsg) > + parse_compress_specification(AH->compression_spec.algorithm, > + NULL, _spec); > + if (compress_spec.parse_error != NULL) > { > pg_log_warning("archive is compressed, but this installation does not > support compression (%s > - errmsg); > - pg_free(errmsg); > + compress_spec.parse_error); > + pg_free(compress_spec.parse_error); > } The top-level error here is "does not support compression", but wouldn't a bad specification option with a supported compression method trip this path too? > +static void > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream, > + ZSTD_cParameter param, int value, char *paramname) IMO we should avoid stepping on the ZSTD_ namespace with our own internal function names. > + if (cs->readF != NULL) > + { > + zstdcs->dstream = ZSTD_createDStream(); > + if (zstdcs->dstream == NULL) > + pg_fatal("could not initialize compression library"); > + > + zstdcs->input.size = ZSTD_DStreamInSize(); > + zstdcs->input.src = pg_malloc(zstdcs->input.size); > + > + zstdcs->output.size = ZSTD_DStreamOutSize(); > + zstdcs->output.dst = pg_malloc(zstdcs->output.size + 1); > + } > + > + if (cs->writeF != NULL) > + { > + zstdcs->cstream = ZstdCStreamParams(cs->compression_spec); > + > + zstdcs->output.size = ZSTD_CStreamOutSize(); > + zstdcs->output.dst = pg_malloc(zstdcs->output.size); > + zstdcs->output.pos = 0; > + } This seems to suggest that
Re: zstd compression for pg_dump
On Sat, Mar 4, 2023 at 8:57 AM Justin Pryzby wrote: > pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM > generate_series(1,444)i,generate_series(1,9)j GROUP BY 1; > $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c > 82023 > $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c > 1048267 Nice! I did some smoke testing against zstd's GitHub release on Windows. To build against it, I had to construct an import library, and put that and the DLL into the `lib` folder expected by the MSVC scripts... which makes me wonder if I've chosen a harder way than necessary? Parallel zstd dumps seem to work as expected, in that the resulting pg_restore output is identical to uncompressed dumps and nothing explodes. I haven't inspected the threading implementation for safety yet, as you mentioned. And I still wasn't able to test :workers, since it looks like the official libzstd for Windows isn't built for multithreading. That'll be another day's project. --Jacob
Re: zstd compression for pg_dump
On Sat, Feb 25, 2023 at 07:22:27PM -0600, Justin Pryzby wrote: > On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote: > > This is a draft patch - review is welcome and would help to get this > > ready to be considererd for v16, if desired. > > > > I'm going to add this thread to the old CF entry. > > https://commitfest.postgresql.org/31/2888/ > > This resolves cfbot warnings: windows and cppcheck. > And refactors zstd routines. > And updates docs. > And includes some fixes for earlier patches that these patches conflicts > with/depends on. This rebases over the TAP and doc fixes to LZ4. And adds necessary ENV to makefile and meson. And adds an annoying boilerplate header. And removes supports_compression(), which is what I think Tomas meant when referring to "annoying unsupported cases". And updates zstd.c: fix an off-by-one, allocate in init depending on readF/writeF, do not reset the input buffer on each iteration, and show parameter name in errors. I'd appreciate help checking that this is doing the right things and works correctly with zstd threaded workers. The zstd API says: "use one different context per thread for parallel execution" and "For parallel execution, use one separate ZSTD_CStream per thread". https://github.com/facebook/zstd/blob/dev/lib/zstd.h I understand that to mean that, if pg_dump *itself* were using threads, then each thread would need to call ZSTD_createCStream(). pg_dump isn't threaded, so there's nothing special needed, right? Except that, under windows, pg_dump -Fd -j actually uses threads instead of forking. I *think* that's still safe, since the pgdump threads are created *before* calling zstd functions (see _PrintTocData and _StartData of the custom and directory formats), so it happens naturally that there's a separate zstd stream for each thread of pgdump. -- Justin >From 6a8b88a3dd37d24ebfdaa6a96505476b8a1efe92 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 7 Jan 2023 15:45:06 -0600 Subject: [PATCH 1/3] pg_dump: zstd compression Previously proposed at: 20201221194924.gi30...@telsasoft.com --- doc/src/sgml/ref/pg_dump.sgml | 8 +- src/bin/pg_dump/Makefile | 2 + src/bin/pg_dump/compress_io.c | 79 ++-- src/bin/pg_dump/compress_zstd.c | 515 ++ src/bin/pg_dump/compress_zstd.h | 9 + src/bin/pg_dump/meson.build | 2 + src/bin/pg_dump/pg_backup_archiver.c | 28 +- src/bin/pg_dump/pg_backup_directory.c | 2 + src/bin/pg_dump/pg_dump.c | 13 - src/bin/pg_dump/t/002_pg_dump.pl | 79 +++- src/tools/pginclude/cpluspluscheck| 1 + 11 files changed, 654 insertions(+), 84 deletions(-) create mode 100644 src/bin/pg_dump/compress_zstd.c create mode 100644 src/bin/pg_dump/compress_zstd.h diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 334e4b7fd14..1fb66be1818 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -330,8 +330,9 @@ PostgreSQL documentation machine-readable format that pg_restore can read. A directory format archive can be manipulated with standard Unix tools; for example, files in an uncompressed archive - can be compressed with the gzip or - lz4 tools. + can be compressed with the gzip, + lz4, or + zstd tools. This format is compressed by default using gzip and also supports parallel dumps. @@ -655,7 +656,8 @@ PostgreSQL documentation Specify the compression method and/or the compression level to use. The compression method can be set to gzip, -lz4, or none for no compression. +lz4, zstd, +or none for no compression. A compression detail string can optionally be specified. If the detail string is an integer, it specifies the compression level. Otherwise, it should be a comma-separated list of items, each of the diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile index eb8f59459a1..24de7593a6a 100644 --- a/src/bin/pg_dump/Makefile +++ b/src/bin/pg_dump/Makefile @@ -18,6 +18,7 @@ include $(top_builddir)/src/Makefile.global export GZIP_PROGRAM=$(GZIP) export LZ4 +export ZSTD export with_icu override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS) @@ -29,6 +30,7 @@ OBJS = \ compress_io.o \ compress_lz4.o \ compress_none.o \ + compress_zstd.o \ dumputils.o \ parallel.o \ pg_backup_archiver.o \ diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index ce06f1eac9c..a3c2f36bb67 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -52,8 +52,8 @@ * * InitDiscoverCompressFileHandle tries to infer the compression by the * filename suffix. If the suffix is not yet known then it tries to simply - * open the file and if it fails, it tries to open the same file with the .gz - * suffix, and then
Re: zstd compression for pg_dump
On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote: > > > With this particular dataset, I don't see much improvement with > > > zstd:long. > > > > Yeah. I this could be because either 1) you already got very good > > comprssion without looking at more data; and/or 2) the neighboring data > > is already very similar, maybe equally or more similar, than the further > > data, from which there's nothing to gain. > > What kinds of improvements do you see with your setup? I'm wondering > when we would suggest that people use it. On customer data, I see small improvements - below 10%. But on my first two tries, I made synthetic data sets where it's a lot: $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c 286107 $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c 1709695 That's just 6 identical tables like: pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,99); In this case, "custom" format doesn't see that benefit, because the greatest similarity is across tables, which don't share compressor state. But I think the note that I wrote in the docs about that should be removed - custom format could see a big benefit, as long as the table is big enough, and there's more similarity/repetition at longer distances. Here's one where custom format *does* benefit, due to long-distance repetition within a single table. The data is contrived, but the schema of ID => data is not. What's notable isn't how compressible the data is, but how much *more* compressible it is with long-distance matching. pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM generate_series(1,444)i,generate_series(1,9)j GROUP BY 1; $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c 82023 $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c 1048267 -- Justin
Re: zstd compression for pg_dump
On Fri, Mar 3, 2023 at 10:55 AM Justin Pryzby wrote: > Thanks for looking. If your zstd library is compiled with thread > support, could you also try with :workers=N ? I believe this is working > correctly, but I'm going to ask for help verifying that... Unfortunately not (Ubuntu 20.04): pg_dump: error: could not set compression parameter: Unsupported parameter But that lets me review the error! I think these error messages should say which options caused them. > It'd be especially useful to test under windows, where pgdump/restore > use threads instead of forking... If you have a windows environment but > not set up for development, I think it's possible to get cirrusci to > compile a patch for you and then retrieve the binaries provided as an > "artifact" (credit/blame for this idea should be directed to Thomas > Munro). I should be able to do that next week. > > With this particular dataset, I don't see much improvement with > > zstd:long. > > Yeah. I this could be because either 1) you already got very good > comprssion without looking at more data; and/or 2) the neighboring data > is already very similar, maybe equally or more similar, than the further > data, from which there's nothing to gain. What kinds of improvements do you see with your setup? I'm wondering when we would suggest that people use it. > I don't want to start exposing lots of fine-granined parameters at this > point. In the immediate case, it looks like it may require more than > just adding another parameter: > > Note: If windowLog is set to larger than 27, > --long=windowLog or --memory=windowSize needs to be passed to the > decompressor. Hm. That would complicate things. Thanks, --Jacob
Re: zstd compression for pg_dump
On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote: > On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby wrote: > > This resolves cfbot warnings: windows and cppcheck. > > And refactors zstd routines. > > And updates docs. > > And includes some fixes for earlier patches that these patches conflicts > > with/depends on. > > This'll need a rebase (cfbot took a while to catch up). Soon. > The patchset includes basebackup modifications, which are part of a > different CF entry; was that intended? Yes, it's intentional - if zstd:long mode were to be merged first, then this patch should include long mode from the start. Or, if pgdump+zstd were merged first, then long mode could be added to both places. > I tried this on a local, 3.5GB, mostly-text table (from the UK Price Thanks for looking. If your zstd library is compiled with thread support, could you also try with :workers=N ? I believe this is working correctly, but I'm going to ask for help verifying that... It'd be especially useful to test under windows, where pgdump/restore use threads instead of forking... If you have a windows environment but not set up for development, I think it's possible to get cirrusci to compile a patch for you and then retrieve the binaries provided as an "artifact" (credit/blame for this idea should be directed to Thomas Munro). > With this particular dataset, I don't see much improvement with > zstd:long. Yeah. I this could be because either 1) you already got very good comprssion without looking at more data; and/or 2) the neighboring data is already very similar, maybe equally or more similar, than the further data, from which there's nothing to gain. > (At nearly double the CPU time, I get a <1% improvement in > compression size.) I assume it's heavily data dependent, but from the > notes on --long [2] it seems like they expect you to play around with > the window size to further tailor it to your data. Does it make sense > to provide the long option without the windowLog parameter? I don't want to start exposing lots of fine-granined parameters at this point. In the immediate case, it looks like it may require more than just adding another parameter: Note: If windowLog is set to larger than 27, --long=windowLog or --memory=windowSize needs to be passed to the decompressor. -- Justin
Re: zstd compression for pg_dump
On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby wrote: > This resolves cfbot warnings: windows and cppcheck. > And refactors zstd routines. > And updates docs. > And includes some fixes for earlier patches that these patches conflicts > with/depends on. This'll need a rebase (cfbot took a while to catch up). The patchset includes basebackup modifications, which are part of a different CF entry; was that intended? I tried this on a local, 3.5GB, mostly-text table (from the UK Price Paid dataset [1]) and the comparison against the other methods was impressive. (I'm no good at constructing compression benchmarks, so this is a super naive setup. Client's on the same laptop as the server.) $ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z zstd > /tmp/zstd.dump real1m17.632s user0m35.521s sys0m2.683s $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z lz4 > /tmp/lz4.dump real1m13.125s user0m19.795s sys0m3.370s $ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z gzip > /tmp/gzip.dump real2m24.523s user2m22.114s sys0m1.848s $ ls -l /tmp/*.dump -rw-rw-r-- 1 jacob jacob 1331493925 Mar 3 09:45 /tmp/gzip.dump -rw-rw-r-- 1 jacob jacob 2125998939 Mar 3 09:42 /tmp/lz4.dump -rw-rw-r-- 1 jacob jacob 1215834718 Mar 3 09:40 /tmp/zstd.dump Default gzip was the only method that bottlenecked on pg_dump rather than the server, and default zstd outcompressed it at a fraction of the CPU time. So, naively, this looks really good. With this particular dataset, I don't see much improvement with zstd:long. (At nearly double the CPU time, I get a <1% improvement in compression size.) I assume it's heavily data dependent, but from the notes on --long [2] it seems like they expect you to play around with the window size to further tailor it to your data. Does it make sense to provide the long option without the windowLog parameter? Thanks, --Jacob [1] https://landregistry.data.gov.uk/ [2] https://github.com/facebook/zstd/releases/tag/v1.3.2
Re: zstd compression for pg_dump
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote: > This is a draft patch - review is welcome and would help to get this > ready to be considererd for v16, if desired. > > I'm going to add this thread to the old CF entry. > https://commitfest.postgresql.org/31/2888/ This resolves cfbot warnings: windows and cppcheck. And refactors zstd routines. And updates docs. And includes some fixes for earlier patches that these patches conflicts with/depends on. >From ea9b67d09fe1b51e7946cf34fca5795e57dd3858 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 24 Feb 2023 14:07:09 -0600 Subject: [PATCH 1/4] f!fixes for LZ4 --- doc/src/sgml/ref/pg_dump.sgml| 4 ++-- src/bin/pg_dump/compress_gzip.c | 8 src/bin/pg_dump/compress_io.c| 2 +- src/bin/pg_dump/compress_io.h| 4 ++-- src/bin/pg_dump/compress_lz4.c | 4 ++-- src/bin/pg_dump/pg_backup_archiver.c | 4 ++-- src/bin/pg_dump/pg_dump.c| 2 -- src/bin/pg_dump/t/002_pg_dump.pl | 6 +++--- 8 files changed, 12 insertions(+), 22 deletions(-) diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 49d218905fb..6fbe49f7ede 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -331,7 +331,7 @@ PostgreSQL documentation can read. A directory format archive can be manipulated with standard Unix tools; for example, files in an uncompressed archive can be compressed with the gzip or - lz4tool. + lz4 tools. This format is compressed by default using gzip and also supports parallel dumps. @@ -655,7 +655,7 @@ PostgreSQL documentation Specify the compression method and/or the compression level to use. The compression method can be set to gzip or -lz4 or none for no compression. +lz4, or none for no compression. A compression detail string can optionally be specified. If the detail string is an integer, it specifies the compression level. Otherwise, it should be a comma-separated list of items, each of the diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c index 0af65afeb4e..52f41c2e58c 100644 --- a/src/bin/pg_dump/compress_gzip.c +++ b/src/bin/pg_dump/compress_gzip.c @@ -123,17 +123,9 @@ WriteDataToArchiveGzip(ArchiveHandle *AH, CompressorState *cs, gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1); gzipcs->outsize = ZLIB_OUT_SIZE; - /* - * A level of zero simply copies the input one block at the time. This - * is probably not what the user wanted when calling this interface. - */ - if (cs->compression_spec.level == 0) - pg_fatal("requested to compress the archive yet no level was specified"); - if (deflateInit(zp, cs->compression_spec.level) != Z_OK) pg_fatal("could not initialize compression library: %s", zp->msg); - /* Just be paranoid - maybe End is called after Start, with no Write */ zp->next_out = gzipcs->outbuf; zp->avail_out = gzipcs->outsize; } diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index ce06f1eac9c..9239dbb2755 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -83,7 +83,7 @@ * used by the caller in an error message. */ char * -supports_compression(const pg_compress_specification compression_spec) +pgdump_supports_compression(const pg_compress_specification compression_spec) { const pg_compress_algorithm algorithm = compression_spec.algorithm; bool supported = false; diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h index bbde2693915..46815fa2ebe 100644 --- a/src/bin/pg_dump/compress_io.h +++ b/src/bin/pg_dump/compress_io.h @@ -21,7 +21,7 @@ #define ZLIB_OUT_SIZE 4096 #define ZLIB_IN_SIZE 4096 -extern char *supports_compression(const pg_compress_specification compression_spec); +extern char *pgdump_supports_compression(const pg_compress_specification compression_spec); /* * Prototype for callback function used in writeData() @@ -172,7 +172,7 @@ struct CompressFileHandle extern CompressFileHandle *InitCompressFileHandle(const pg_compress_specification compression_spec); /* - * Initialize a compress file stream. Deffer the compression algorithm + * Initialize a compress file stream. Infer the compression algorithm * from 'path', either by examining its suffix or by appending the supported * suffixes in 'path'. */ diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c index fe1014e6e77..63e794cdc68 100644 --- a/src/bin/pg_dump/compress_lz4.c +++ b/src/bin/pg_dump/compress_lz4.c @@ -161,8 +161,8 @@ typedef struct LZ4File } LZ4File; /* - * LZ4 equivalent to feof() or gzeof(). The end of file is reached if there - * is no decompressed output in the overflow buffer and the end of the file + * LZ4 equivalent to feof() or gzeof(). Return true iff there is no + * decompressed
Re: zstd compression for pg_dump
On Sat, Feb 25, 2023, at 7:31 AM, Tomas Vondra wrote: > On 2/24/23 20:18, Justin Pryzby wrote: > > This is a draft patch - review is welcome and would help to get this > > ready to be considererd for v16, if desired. > > > > I'm going to add this thread to the old CF entry. > > https://commitfest.postgresql.org/31/2888/ > > > > Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed > because of some issue in pg_backup_archiver.h. But it's a bit bizarre > because the patch does not modify that file at all ... cpluspluscheck says # pg_dump is not C++-clean because it uses "public" and "namespace" # as field names, which is unfortunate but we won't change it now. Hence, the patch should exclude the new header file from it. --- a/src/tools/pginclude/cpluspluscheck +++ b/src/tools/pginclude/cpluspluscheck @@ -153,6 +153,7 @@ do test "$f" = src/bin/pg_dump/compress_gzip.h && continue test "$f" = src/bin/pg_dump/compress_io.h && continue test "$f" = src/bin/pg_dump/compress_lz4.h && continue + test "$f" = src/bin/pg_dump/compress_zstd.h && continue test "$f" = src/bin/pg_dump/compress_none.h && continue test "$f" = src/bin/pg_dump/parallel.h && continue test "$f" = src/bin/pg_dump/pg_backup_archiver.h && continue -- Euler Taveira EDB https://www.enterprisedb.com/
Re: zstd compression for pg_dump
On 2/24/23 20:18, Justin Pryzby wrote: > This is a draft patch - review is welcome and would help to get this > ready to be considererd for v16, if desired. > > I'm going to add this thread to the old CF entry. > https://commitfest.postgresql.org/31/2888/ > Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed because of some issue in pg_backup_archiver.h. But it's a bit bizarre because the patch does not modify that file at all ... regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On Sat, Feb 25, 2023 at 01:44:36PM +0900, Michael Paquier wrote: > On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote: > > This is a draft patch - review is welcome and would help to get this > > ready to be considererd for v16, if desired. > > > > I'm going to add this thread to the old CF entry. > > https://commitfest.postgresql.org/31/2888/ > > Patch 0003 adds support for the --long option of zstd, meaning that it > "enables long distance matching with #windowLog". What's the benefit > of that when it is applied to dumps and base backup contents? It (can) makes it smaller. + The long keyword enables long-distance matching + mode, for improved compression ratio, at the expense of higher memory + use. Long-distance mode is supported only for +With zstd compression, long mode may allow dumps +to be significantly smaller, but it might not reduce the size of +custom or directory format dumps, whose fields are separately compressed. Note that I included that here as 003, but I also have an pre-existing patch for adding that just to basebackup. -- Justin
Re: zstd compression for pg_dump
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote: > This is a draft patch - review is welcome and would help to get this > ready to be considererd for v16, if desired. > > I'm going to add this thread to the old CF entry. > https://commitfest.postgresql.org/31/2888/ Patch 0003 adds support for the --long option of zstd, meaning that it "enables long distance matching with #windowLog". What's the benefit of that when it is applied to dumps and base backup contents? -- Michael signature.asc Description: PGP signature
Re: zstd compression for pg_dump
On 1/3/21 9:53 PM, Justin Pryzby wrote: I rebased so the "typedef struct compression" patch is first and zstd on top of that (say, in case someone wants to bikeshed about which compression algorithm to support). And made a central struct with all the compression-specific info to further isolate the compress-specific changes. It has been a few months since there was a new patch and the current one no longer applies, so marking Returned with Feedback. Please resubmit to the next CF when you have a new patch. Regards, -- -David da...@pgmasters.net
Re: zstd compression for pg_dump
On 1/4/21 11:17 AM, Daniil Zakhlystov wrote: > Hi! > >> On Jan 4, 2021, at 11:04 AM, Andrey Borodin wrote: >> >> Daniil, is levels definition compatible with libpq compression patch? >> +typedef struct Compress { >> +CompressionAlgorithmalg; >> +int level; >> +/* Is a nondefault level set ? This is useful since different >> compression >> + * methods have different "default" levels. For now we assume the >> levels >> + * are all integer, though. >> +*/ >> +boollevel_set; >> +} Compress; > > Similarly to this patch, it is also possible to define the compression level > at the initialization stage in libpq compression patch. > > The difference is that in libpq compression patch the default compression > level always equal to 1, independently of the chosen compression algorithm. > >> On Jan 4, 2021, at 11:04 AM, Andrey Borodin wrote: >> >> Libpq compression encountered some problems with memory consumption which >> required some extra config efforts. > > >> On Jan 4, 2021, at 12:06 PM, Justin Pryzby wrote: >> >> RAM use is not significantly different from zlib, except that zstd --long >> adds >> more memory. > > Regarding ZSTD memory usage: > > Recently I’ve made a couple of tests of libpq compression with different > ZLIB/ZSTD compression levels which shown that compressing/decompressing ZSTD > w/ high compression levels > require to allocate more virtual (Commited_AS) memory, which may be exploited > by malicious clients: > > https://www.postgresql.org/message-id/62527092-16BD-479F-B503-FA527AF3B0C2%40yandex-team.ru > > We can avoid high memory usage by limiting the max window size to 8MB. This > should effectively disable the support of compression levels above 19: > https://www.postgresql.org/message-id/6A45DFAA-1682-4EF2-B835-C5F46615EC49%40yandex-team.ru > > So maybe it is worthwhile to use similar restrictions in this patch. > I think there's a big difference between those two patches. In the libpq case, the danger is that the client requests the server to compress the data in a way that requires a lot of memory. I.e. the memory is consumed on the server. With this pg_dump patch, the compression is done by the pg_dump process, not the server. So if the attacker configures the compression in a way that requires a lot of memory, so what? He'll just allocate memory on the client machine, where he could also just run a custom binary that does a huge malloc(). So I don't think we need to worry about this too much. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: zstd compression for pg_dump
On 1/4/21 3:53 AM, Justin Pryzby wrote: About 89% smaller. Did a quick code review of the patch. I have not yet taken it for a spin yet and there are parts of the code I have not read yet. ## Is there any reason for this diff? -cfp*fp = pg_malloc(sizeof(cfp)); +cfp*fp = pg_malloc0(sizeof(cfp)); ## Since we know have multiple returns in cfopen() I am not sure that setting fp to NULL is still clearer than just returning NULL. ## I do not like that this pretends to support r+, w+ and a+ but does not actually do so since it does not create an input stream in those cases. else if (mode[0] == 'w' || mode[0] == 'a' || strchr(mode, '+') != NULL) [...] else if (strchr(mode, 'r')) ## Wouldn't cfread(), cfwrite(), cfgetc(), cfgets(), cfclose() and cfeof() be cleaner with sitch statments similar to cfopen()? ## "/* Should be called "method" or "library" ? */" Maybe, but personally I think algorithm is fine too. ## "Is a nondefault level set ?" The PostgreSQL project does not use space before question mark (at least not in English). ## Why isn't level_set just a local variable in parse_compression()? It does not seem to be used elsewhere. ## Shouldn't we call the Compression variable in OpenArchive() nocompress to match with the naming convention in other places. And in general I wonder if we should not write "nocompression = {COMPR_ALG_NONE}" rather than "nocompression = {0}". ## Why not use const on the pointers to Compression for functions like cfopen()? As far as I can see several of them could be const. ## Shouldn't "AH->compression.alg = Z_DEFAULT_COMPRESSION" in ReadHead() be "AH->compression.alg = COMPR_ALG_DEFAULT"? Additionally I am not convinced that returning COMPR_ALG_DEFAULT will even work but I have not had the time to test that theory yet. And in general I am quite sceptical of that we really need of COMPR_ALG_DEFAULT. ## Some white space issues Add spaces around plus in "atoi(1+eq)" and "pg_log_error("unknown compression algorithm: %s", 1+eq)". Add spaces around plus in parse_compression(), e.g. in "strlen(1+eq)". ## Shouldn't hasSuffix() take the current compression algorithm as a parameter? Or alternatively look up which compression algorithm to use from the suffix? ## Why support multiple ways to write zlib on the command line? I do not see any advatange of being able to write it as libz. ## I feel renaming SaveOutput() to GetOutput() would make it more clear what it does now that you have changed the return type. ## You have accidentally committed "-runstatedir" in configure. I have no idea why we do not have it (maybe it is something Debian specific) but even if we are going to add it it should not be in this patch. Same with the parenthesis changes to LARGE_OFF_T. ## This is probably out of scope of your patch but I am not a fan of the fallback logic in cfopen_read(). I feel ideally we should always know if there is a suffix or not and not try to guess file names and do pointless syscalls. ## COMPR_ALG_DEFAULT looks like it would error out for archive and directory if someone has neither zlib nor zstandard. It feels like it should default to uncompressed if we have neither. Or at least give a better error message. Note, there's currently several "compression" patches in CF app. This patch seems to be independent of the others, but probably shouldn't be totally uncoordinated (like adding lz4 in one and ztsd in another might be poor execution). A thought here is that maybe we want to use the same values for the enums in all patches. Especially if we write the numeric value to pg dump files. Andreas
Re: zstd compression for pg_dump
Hi! > On Jan 4, 2021, at 11:04 AM, Andrey Borodin wrote: > > Daniil, is levels definition compatible with libpq compression patch? > +typedef struct Compress { > + CompressionAlgorithmalg; > + int level; > + /* Is a nondefault level set ? This is useful since different > compression > + * methods have different "default" levels. For now we assume the > levels > + * are all integer, though. > + */ > + boollevel_set; > +} Compress; Similarly to this patch, it is also possible to define the compression level at the initialization stage in libpq compression patch. The difference is that in libpq compression patch the default compression level always equal to 1, independently of the chosen compression algorithm. > On Jan 4, 2021, at 11:04 AM, Andrey Borodin wrote: > > Libpq compression encountered some problems with memory consumption which > required some extra config efforts. > On Jan 4, 2021, at 12:06 PM, Justin Pryzby wrote: > > RAM use is not significantly different from zlib, except that zstd --long adds > more memory. Regarding ZSTD memory usage: Recently I’ve made a couple of tests of libpq compression with different ZLIB/ZSTD compression levels which shown that compressing/decompressing ZSTD w/ high compression levels require to allocate more virtual (Commited_AS) memory, which may be exploited by malicious clients: https://www.postgresql.org/message-id/62527092-16BD-479F-B503-FA527AF3B0C2%40yandex-team.ru We can avoid high memory usage by limiting the max window size to 8MB. This should effectively disable the support of compression levels above 19: https://www.postgresql.org/message-id/6A45DFAA-1682-4EF2-B835-C5F46615EC49%40yandex-team.ru So maybe it is worthwhile to use similar restrictions in this patch. — Daniil Zakhlystov
Re: zstd compression for pg_dump
On Mon, Jan 04, 2021 at 11:04:57AM +0500, Andrey Borodin wrote: > > 4 янв. 2021 г., в 07:53, Justin Pryzby написал(а): > > Note, there's currently several "compression" patches in CF app. This patch > > seems to be independent of the others, but probably shouldn't be totally > > uncoordinated (like adding lz4 in one and ztsd in another might be poor > > execution). > > > > https://commitfest.postgresql.org/31/2897/ > > - Faster pglz compression > > https://commitfest.postgresql.org/31/2813/ > > - custom compression methods for toast > > https://commitfest.postgresql.org/31/2773/ > > - libpq compression > > I think that's downside of our development system: patch authors do not want > to create dependencies on other patches. I think in these cases, someone who notices common/overlapping patches should suggest that the authors review each other's work. In some cases, I think it's appropriate to come up with a "shared" preliminary patch(es), which both (all) patch authors can include as 0001 until its finalized and merged. That might be true for some things like the tableam work, or the two "online checksum" patches. > I'd say that both lz4 and zstd should be supported in TOAST, FPIs, libpq, and > pg_dump. As to pglz - I think we should not proliferate it any further. pg_basebackup came up as another use on another thread, I think related to libpq protocol compression. > Libpq compression encountered some problems with memory consumption which > required some extra config efforts. Did you measure memory usage for this > patchset? RAM use is not significantly different from zlib, except that zstd --long adds more memory. $ command time -v pg_dump -d ts -t ... -Fc -Z0 |wc -c Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.77 Maximum resident set size (kbytes): 40504 1397288924 # no compression: 1400MB $ command time -v pg_dump -d ts -t ... -Fc |wc -c Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.17 Maximum resident set size (kbytes): 40504 132932415 # default (zlib) compression: 132 MB $ command time -v ./pg_dump -d ts -t ... -Fc |wc -c Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.28 Maximum resident set size (kbytes): 40568 86048139 # zstd: 86MB $ command time -v ./pg_dump -d ts -t ... -Fc -Z 'alg=zstd opt=zstdlong' |wc -c Elapsed (wall clock) time (h:mm:ss or m:ss): 0:30.49 Maximum resident set size (kbytes): 180332 72202937 # zstd long: 180MB > [PATCH 04/20] struct compressLibs > I think this directive would be correct. > +// #ifdef HAVE_LIBZ? I'm not sure .. I'm thinking of making the COMPR_ALG_* always defined, and then fail later if an operation is unsupported. There's an excessive number of #ifdefs already, so the early commits are intended to minimize as far as possible what's needed for each additional compression algorithm(lib/method/whatever it's called). I haven't tested much with pg_restore of files with unsupported compression libs. > [PATCH 06/20] pg_dump: zstd compression > I'd propose to build with Zstd by default. It seems other patches do it this > way. Though, I there are possible downsides. Yes...but the cfbot turns red if the patch require zstd, so it defaults to off until it's included in the build environments (but for now, the main patch isn't being tested). Thanks for looking. -- Justin
Re: zstd compression for pg_dump
> 4 янв. 2021 г., в 07:53, Justin Pryzby написал(а): > > Note, there's currently several "compression" patches in CF app. This patch > seems to be independent of the others, but probably shouldn't be totally > uncoordinated (like adding lz4 in one and ztsd in another might be poor > execution). > > https://commitfest.postgresql.org/31/2897/ > - Faster pglz compression > https://commitfest.postgresql.org/31/2813/ > - custom compression methods for toast > https://commitfest.postgresql.org/31/2773/ > - libpq compression I think that's downside of our development system: patch authors do not want to create dependencies on other patches. I'd say that both lz4 and zstd should be supported in TOAST, FPIs, libpq, and pg_dump. As to pglz - I think we should not proliferate it any further. Lz4 and Zstd represent a different tradeoff actually. Basically, lz4 is so CPU-cheap that one should use it whenever they write to disk or network interface. Zstd represent an actual bandwith\CPU tradeoff. Also, all patchsets do not touch important possibility - preexisting dictionary could radically improve compression of small data (event in pglz). Some minor notes on patchset at this thread. Libpq compression encountered some problems with memory consumption which required some extra config efforts. Did you measure memory usage for this patchset? [PATCH 03/20] Support multiple compression algs/levels/opts.. abtracts -> abstracts enum CompressionAlgorithm actually represent the very same thing as in "Custom compression methods" Daniil, is levels definition compatible with libpq compression patch? +typedef struct Compress { + CompressionAlgorithmalg; + int level; + /* Is a nondefault level set ? This is useful since different compression +* methods have different "default" levels. For now we assume the levels +* are all integer, though. + */ + boollevel_set; +} Compress; [PATCH 04/20] struct compressLibs I think this directive would be correct. +// #ifdef HAVE_LIBZ? Here's extra comment // && errno == ENOENT) [PATCH 06/20] pg_dump: zstd compression I'd propose to build with Zstd by default. It seems other patches do it this way. Though, I there are possible downsides. Thanks for working on this! We will have very IO-efficient Postgres :) Best regards, Andrey Borodin.
Re: zstd compression for pg_dump
On Mon, Dec 21, 2020 at 01:49:24PM -0600, Justin Pryzby wrote: > a big disadvantage of piping through zstd is that it's not identified as a > PGDMP file, and, /usr/bin/file on centos7 fails to even identify zstd by its > magic number.. Other reasons are that pg_dump |zstd >output.zst loses the exit status of pg_dump, and that it's not "transparent" (one needs to type "zstd -dq |pg_restore"). On Mon, Dec 21, 2020 at 08:32:35PM -0600, Justin Pryzby wrote: > On Mon, Dec 21, 2020 at 03:02:40PM -0500, Tom Lane wrote: > > Justin Pryzby writes: > > > I found that our largest tables are 40% smaller and 20% faster to pipe > > > pg_dump -Fc -Z0 |zstd relative to native zlib > > > > The patch might be a tad smaller if you hadn't included a core file in it. > > About 89% smaller. > > This also fixes the extension (.zst) > And fixes zlib default compression. > And a bunch of cleanup. I rebased so the "typedef struct compression" patch is first and zstd on top of that (say, in case someone wants to bikeshed about which compression algorithm to support). And made a central struct with all the compression-specific info to further isolate the compress-specific changes. And handle compression of "plain" archive format. And fix compilation for MSVC and make --without-zstd the default. And fix cfgets() (which I think is actually unused code for the code paths for compressed FP). And add fix for pre-existing problem: ftello() on unseekable input. I also started a patch to allow compression of "tar" format, but I didn't include that here yet. Note, there's currently several "compression" patches in CF app. This patch seems to be independent of the others, but probably shouldn't be totally uncoordinated (like adding lz4 in one and ztsd in another might be poor execution). https://commitfest.postgresql.org/31/2897/ - Faster pglz compression https://commitfest.postgresql.org/31/2813/ - custom compression methods for toast https://commitfest.postgresql.org/31/2773/ - libpq compression -- Justin >From d2fc2673e19a95629edfe9a0f4ead75e1f1f2754 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Wed, 23 Dec 2020 23:56:54 -0600 Subject: [PATCH 01/20] fix!preeexisting --- src/bin/pg_dump/compress_io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c index 1417401086..6a428978d4 100644 --- a/src/bin/pg_dump/compress_io.c +++ b/src/bin/pg_dump/compress_io.c @@ -28,7 +28,7 @@ * The interface for reading an archive consists of just one function: * ReadDataFromArchive. ReadDataFromArchive reads the whole compressed input * stream, by repeatedly calling the given ReadFunc. ReadFunc returns the - * compressed data chunk at a time, and ReadDataFromArchive decompresses it + * compressed data one chunk at a time, and ReadDataFromArchive decompresses it * and passes the decompressed data to ahwrite(), until ReadFunc returns 0 * to signal EOF. * -- 2.17.0 >From e8bb61dd633aefb2cc7a14887a15cc60e05cd9c5 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 22 Dec 2020 15:40:08 -0600 Subject: [PATCH 02/20] Fix broken error message on unseekable input.. pg_dump -Fd -f dir.dir cat dir.dir/toc.dat | pg_restore -l # I realize this isn't how it's intended to be used pg_restore: error: corrupt tar header found in PGDMP (expected 0, computed 18577) file position 18446744073709551615 See also 929c69aa19 --- src/bin/pg_dump/pg_backup_tar.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c index 54e708875c..61c9c87a9f 100644 --- a/src/bin/pg_dump/pg_backup_tar.c +++ b/src/bin/pg_dump/pg_backup_tar.c @@ -1280,12 +1280,14 @@ _tarGetHeader(ArchiveHandle *AH, TAR_MEMBER *th) if (chk != sum) { - char posbuf[32]; + off_t off = ftello(ctx->tarFH); - snprintf(posbuf, sizeof(posbuf), UINT64_FORMAT, - (uint64) ftello(ctx->tarFH)); - fatal("corrupt tar header found in %s (expected %d, computed %d) file position %s", - tag, sum, chk, posbuf); + if (off == -1) + fatal("corrupt tar header found in %s (expected %d, computed %d)", + tag, sum, chk); + else + fatal("corrupt tar header found in %s (expected %d, computed %d) file position " UINT64_FORMAT, + tag, sum, chk, off); } th->targetFile = pg_strdup(tag); -- 2.17.0 >From bbf91d13a25bcb2f0718f40984362f81741ca66a Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 22 Dec 2020 00:23:43 -0600 Subject: [PATCH 03/20] Support multiple compression algs/levels/opts.. The existing implementation abtracts compressed and noncompressed I/O. This preliminary commit is intended to also allow for multiple compression algorithms. --- src/bin/pg_dump/compress_io.c | 220 +++--- src/bin/pg_dump/compress_io.h | 19 +-- src/bin/pg_dump/pg_backup.h | 23 ++- src/bin/pg_dump/pg_backup_archiver.c | 45 +++---
Re: zstd compression for pg_dump
On Mon, Dec 21, 2020 at 03:02:40PM -0500, Tom Lane wrote: > Justin Pryzby writes: > > I found that our largest tables are 40% smaller and 20% faster to pipe > > pg_dump -Fc -Z0 |zstd relative to native zlib > > The patch might be a tad smaller if you hadn't included a core file in it. About 89% smaller. This also fixes the extension (.zst) And fixes zlib default compression. And a bunch of cleanup. -- Justin >From c506c8a93ae70726a5b4b33a1d0098caf5665f3a Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Mon, 21 Dec 2020 00:32:32 -0600 Subject: [PATCH 1/7] fix pre-existing docs/comments --- doc/src/sgml/ref/pg_dump.sgml | 2 +- src/bin/pg_dump/pg_backup_directory.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml index 0aa35cf0c3..dcb25dc3cd 100644 --- a/doc/src/sgml/ref/pg_dump.sgml +++ b/doc/src/sgml/ref/pg_dump.sgml @@ -621,7 +621,7 @@ PostgreSQL documentation Specify the compression level to use. Zero means no compression. -For the custom archive format, this specifies compression of +For the custom and directory archive formats, this specifies compression of individual table-data segments, and the default is to compress at a moderate level. For plain text output, setting a nonzero compression level causes diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c index 48fa7cb1a3..650b542fce 100644 --- a/src/bin/pg_dump/pg_backup_directory.c +++ b/src/bin/pg_dump/pg_backup_directory.c @@ -4,7 +4,7 @@ * * A directory format dump is a directory, which contains a "toc.dat" file * for the TOC, and a separate file for each data entry, named ".dat". - * Large objects (BLOBs) are stored in separate files named "blob_.dat", + * Large objects (BLOBs) are stored in separate files named "blob_.dat", * and there's a plain-text TOC file for them called "blobs.toc". If * compression is used, each data file is individually compressed and the * ".gz" suffix is added to the filenames. The TOC files are never -- 2.17.0 >From 49f400b5b2e559cc2ed6be82e04a4dc4ba5c63b3 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Sat, 12 Dec 2020 00:11:37 -0600 Subject: [PATCH 2/7] Fix malformed comment.. broken since bf50caf10. --- src/bin/pg_dump/pg_backup_archiver.h | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h index 177360ed6e..6a5a22637b 100644 --- a/src/bin/pg_dump/pg_backup_archiver.h +++ b/src/bin/pg_dump/pg_backup_archiver.h @@ -329,9 +329,12 @@ struct _archiveHandle DumpId *tableDataId; /* TABLE DATA ids, indexed by table dumpId */ struct _tocEntry *currToc; /* Used when dumping data */ - int compression; /* Compression requested on open Possible - * values for compression: -1 - * Z_DEFAULT_COMPRESSION 0 COMPRESSION_NONE + int compression; /*- + * Compression requested on open + * Possible values for compression: + * -2 ZSTD_COMPRESSION + * -1 Z_DEFAULT_COMPRESSION + * 0 COMPRESSION_NONE * 1-9 levels for gzip compression */ bool dosync; /* data requested to be synced on sight */ ArchiveMode mode; /* File mode - r or w */ -- 2.17.0 >From 8a8939a3060c984af289b5fe62754d94f2675248 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Fri, 11 Dec 2020 15:01:25 -0600 Subject: [PATCH 3/7] pg_dump: zstd compression document any change in search for .gz? docs --- configure | 123 ++- configure.ac | 22 ++ src/bin/pg_dump/compress_io.c | 475 -- src/bin/pg_dump/compress_io.h | 5 +- src/bin/pg_dump/pg_backup_archiver.h | 5 + src/bin/pg_dump/pg_backup_directory.c | 6 +- src/bin/pg_dump/pg_dump.c | 3 +- src/include/pg_config.h.in| 3 + 8 files changed, 597 insertions(+), 45 deletions(-) diff --git a/configure b/configure index 11a4284e5b..240e536e04 100755 --- a/configure +++ b/configure @@ -698,6 +698,7 @@ with_gnu_ld LD LDFLAGS_SL LDFLAGS_EX +with_zstd with_zlib with_system_tzdata with_libxslt @@ -798,6 +799,7 @@ infodir docdir oldincludedir includedir +runstatedir localstatedir sharedstatedir sysconfdir @@ -866,6 +868,7 @@ with_libxml with_libxslt with_system_tzdata with_zlib +with_zstd with_gnu_ld enable_largefile ' @@ -935,6 +938,7 @@ datadir='${datarootdir}' sysconfdir='${prefix}/etc' sharedstatedir='${prefix}/com' localstatedir='${prefix}/var' +runstatedir='${localstatedir}/run' includedir='${prefix}/include' oldincludedir='/usr/include' docdir='${datarootdir}/doc/${PACKAGE_TARNAME}' @@ -1187,6 +1191,15 @@ do | -silent | --silent | --silen | --sile | --sil) silent=yes ;; + -runstatedir | --runstatedir | --runstatedi | --runstated \ + |
Re: zstd compression for pg_dump
Justin Pryzby writes: > I found that our largest tables are 40% smaller and 20% faster to pipe > pg_dump -Fc -Z0 |zstd relative to native zlib The patch might be a tad smaller if you hadn't included a core file in it. regards, tom lane