Re: zstd compression for pg_dump

2023-04-06 Thread Tom Lane
Justin Pryzby  writes:
> On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote:
>> I think that's all for PG16 in this patch series.  If there's more we want to
>> do, it'll have to wait for PG17 - 

> Yes

Shouldn't the CF entry be closed as committed?  It's certainly
making the cfbot unhappy because the patch-of-record doesn't apply.

regards, tom lane




Re: zstd compression for pg_dump

2023-04-06 Thread Justin Pryzby
On Thu, Apr 06, 2023 at 05:34:30PM +0200, Tomas Vondra wrote:
> I looked at the long mode patch again, updated the commit message and
> pushed it. I was wondering if long_mode should really be bool -
> logically it is, but ZSTD_CCtx_setParameter() expects int. But I think
> that's fine.

Thanks!

> I think that's all for PG16 in this patch series.  If there's more we want to
> do, it'll have to wait for PG17 - 

Yes

> Justin, can you update and submit the patches that you think are relevant for
> the next CF?

Yeah.

It sounds like a shiny new feature, but it's not totally clear if it's safe
here or even how useful it is.  (It might be like my patch for
wal_compression=zstd:level, and Michael's for toast_compression=zstd, neither
of which saw any support).

Last year's basebackup thread had some interesting comments about safety of
threads, although pg_dump's considerations may be different.

The patch itself is trivial, so it'd be fine to wait until PG16 is released to
get some experience.  If someone else wanted to do that, it'd be fine with me.

-- 
Justin




Re: zstd compression for pg_dump

2023-04-06 Thread Tomas Vondra



On 4/5/23 21:42, Tomas Vondra wrote:
> On 4/4/23 05:04, Justin Pryzby wrote:
>> On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
>>> On 4/3/23 21:17, Justin Pryzby wrote:
 On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
>> Feel free to mess around with threads (but I'd much rather see the patch
>> progress for zstd:long).
>
> OK, understood. The long mode patch is pretty simple. IIUC it does not
> change the format, i.e. in the worst case we could leave it for PG17
> too. Correct?

 Right, libzstd only has one "format", which is the same as what's used
 by the commandline tool.  zstd:long doesn't change the format of the
 output: the library just uses a larger memory buffer to allow better
 compression.  There's no format change for zstd:workers, either.
>>>
>>> OK. I plan to do a bit more review/testing on this, and get it committed
>>> over the next day or two, likely including the long mode. One thing I
>>> noticed today is that maybe long_distance should be a bool, not int.
>>> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
>>> cleaner to cast the value during a call and keep it bool otherwise.
>>
>> Thanks for noticing.  Evidently I wrote it using "int" to get the
>> feature working, and then later wrote the bool parsing bits but never
>> changed the data structure.
>>
>> This also updates a few comments, indentation, removes a useless
>> assertion, and updates the warning about zstd:workers.
>>
> 
> Thanks. I've cleaned up the 0001 a little bit (a couple comment
> improvements), updated the commit message and pushed it. I plan to take
> care of the 0002 (long distance mode) tomorrow, and that'll be it for
> PG16 I think.

I looked at the long mode patch again, updated the commit message and
pushed it. I was wondering if long_mode should really be bool -
logically it is, but ZSTD_CCtx_setParameter() expects int. But I think
that's fine.

I think that's all for PG16 in this patch series. If there's more we
want to do, it'll have to wait for PG17 - Justin, can you update and
submit the patches that you think are relevant for the next CF?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-04-05 Thread Tomas Vondra
On 4/4/23 05:04, Justin Pryzby wrote:
> On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
>> On 4/3/23 21:17, Justin Pryzby wrote:
>>> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> Feel free to mess around with threads (but I'd much rather see the patch
> progress for zstd:long).

 OK, understood. The long mode patch is pretty simple. IIUC it does not
 change the format, i.e. in the worst case we could leave it for PG17
 too. Correct?
>>>
>>> Right, libzstd only has one "format", which is the same as what's used
>>> by the commandline tool.  zstd:long doesn't change the format of the
>>> output: the library just uses a larger memory buffer to allow better
>>> compression.  There's no format change for zstd:workers, either.
>>
>> OK. I plan to do a bit more review/testing on this, and get it committed
>> over the next day or two, likely including the long mode. One thing I
>> noticed today is that maybe long_distance should be a bool, not int.
>> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
>> cleaner to cast the value during a call and keep it bool otherwise.
> 
> Thanks for noticing.  Evidently I wrote it using "int" to get the
> feature working, and then later wrote the bool parsing bits but never
> changed the data structure.
> 
> This also updates a few comments, indentation, removes a useless
> assertion, and updates the warning about zstd:workers.
> 

Thanks. I've cleaned up the 0001 a little bit (a couple comment
improvements), updated the commit message and pushed it. I plan to take
care of the 0002 (long distance mode) tomorrow, and that'll be it for
PG16 I think.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-04-03 Thread Justin Pryzby
On Mon, Apr 03, 2023 at 11:26:09PM +0200, Tomas Vondra wrote:
> On 4/3/23 21:17, Justin Pryzby wrote:
> > On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> >>> Feel free to mess around with threads (but I'd much rather see the patch
> >>> progress for zstd:long).
> >>
> >> OK, understood. The long mode patch is pretty simple. IIUC it does not
> >> change the format, i.e. in the worst case we could leave it for PG17
> >> too. Correct?
> > 
> > Right, libzstd only has one "format", which is the same as what's used
> > by the commandline tool.  zstd:long doesn't change the format of the
> > output: the library just uses a larger memory buffer to allow better
> > compression.  There's no format change for zstd:workers, either.
> 
> OK. I plan to do a bit more review/testing on this, and get it committed
> over the next day or two, likely including the long mode. One thing I
> noticed today is that maybe long_distance should be a bool, not int.
> Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
> cleaner to cast the value during a call and keep it bool otherwise.

Thanks for noticing.  Evidently I wrote it using "int" to get the
feature working, and then later wrote the bool parsing bits but never
changed the data structure.

This also updates a few comments, indentation, removes a useless
assertion, and updates the warning about zstd:workers.

-- 
Justin
>From df0eb4d3c4799f24e58f1e5b0a9470e5af355ad6 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 7 Jan 2023 15:45:06 -0600
Subject: [PATCH 1/4] pg_dump: zstd compression

Previously proposed at: 20201221194924.gi30...@telsasoft.com
---
 doc/src/sgml/ref/pg_dump.sgml |  13 +-
 src/bin/pg_dump/Makefile  |   2 +
 src/bin/pg_dump/compress_io.c |  66 ++--
 src/bin/pg_dump/compress_zstd.c   | 537 ++
 src/bin/pg_dump/compress_zstd.h   |  25 ++
 src/bin/pg_dump/meson.build   |   4 +-
 src/bin/pg_dump/pg_backup_archiver.c  |   9 +-
 src/bin/pg_dump/pg_backup_directory.c |   2 +
 src/bin/pg_dump/pg_dump.c |  20 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |  79 +++-
 src/tools/pginclude/cpluspluscheck|   1 +
 src/tools/pgindent/typedefs.list  |   1 +
 12 files changed, 705 insertions(+), 54 deletions(-)
 create mode 100644 src/bin/pg_dump/compress_zstd.c
 create mode 100644 src/bin/pg_dump/compress_zstd.h

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 77299878e02..8de38e0fd0d 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -330,8 +330,9 @@ PostgreSQL documentation
machine-readable format that pg_restore
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed archive
-   can be compressed with the gzip or
-   lz4 tools.
+   can be compressed with the gzip,
+   lz4, or
+   zstd tools.
This format is compressed by default using gzip
and also supports parallel dumps.
   
@@ -655,7 +656,8 @@ PostgreSQL documentation

 Specify the compression method and/or the compression level to use.
 The compression method can be set to gzip,
-lz4, or none for no compression.
+lz4, zstd,
+or none for no compression.
 A compression detail string can optionally be specified.  If the
 detail string is an integer, it specifies the compression level.
 Otherwise, it should be a comma-separated list of items, each of the
@@ -676,8 +678,9 @@ PostgreSQL documentation
 individual table-data segments, and the default is to compress using
 gzip at a moderate level. For plain text output,
 setting a nonzero compression level causes the entire output file to be compressed,
-as though it had been fed through gzip or
-lz4; but the default is not to compress.
+as though it had been fed through gzip,
+lz4, or zstd;
+but the default is not to compress.


 The tar archive format currently does not support compression at all.
diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index eb8f59459a1..24de7593a6a 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -18,6 +18,7 @@ include $(top_builddir)/src/Makefile.global
 
 export GZIP_PROGRAM=$(GZIP)
 export LZ4
+export ZSTD
 export with_icu
 
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
@@ -29,6 +30,7 @@ OBJS = \
 	compress_io.o \
 	compress_lz4.o \
 	compress_none.o \
+	compress_zstd.o \
 	dumputils.o \
 	parallel.o \
 	pg_backup_archiver.o \
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 0972a4f934a..4f06bb024f9 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -52,8 +52,8 @@
  *
  *	InitDiscoverCompressFileHandle tries to infer the 

Re: zstd compression for pg_dump

2023-04-03 Thread Tomas Vondra



On 4/3/23 21:17, Justin Pryzby wrote:
> On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
>>> Feel free to mess around with threads (but I'd much rather see the patch
>>> progress for zstd:long).
>>
>> OK, understood. The long mode patch is pretty simple. IIUC it does not
>> change the format, i.e. in the worst case we could leave it for PG17
>> too. Correct?
> 
> Right, libzstd only has one "format", which is the same as what's used
> by the commandline tool.  zstd:long doesn't change the format of the
> output: the library just uses a larger memory buffer to allow better
> compression.  There's no format change for zstd:workers, either.
> 

OK. I plan to do a bit more review/testing on this, and get it committed
over the next day or two, likely including the long mode. One thing I
noticed today is that maybe long_distance should be a bool, not int.
Yes, ZSTD_c_enableLongDistanceMatching() accepts int, but it'd be
cleaner to cast the value during a call and keep it bool otherwise.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-04-03 Thread Justin Pryzby
On Sat, Apr 01, 2023 at 10:26:01PM +0200, Tomas Vondra wrote:
> > Feel free to mess around with threads (but I'd much rather see the patch
> > progress for zstd:long).
> 
> OK, understood. The long mode patch is pretty simple. IIUC it does not
> change the format, i.e. in the worst case we could leave it for PG17
> too. Correct?

Right, libzstd only has one "format", which is the same as what's used
by the commandline tool.  zstd:long doesn't change the format of the
output: the library just uses a larger memory buffer to allow better
compression.  There's no format change for zstd:workers, either.

-- 
Justin




Re: zstd compression for pg_dump

2023-04-01 Thread Tomas Vondra
On 4/1/23 15:36, Justin Pryzby wrote:
>
> ...
>
>> If there are no concerns, why disable it outside Windows? I don't have a
>> good idea how beneficial the multi-threaded compression is, so I can't
>> quite judge the risk/benefits tradeoff.
> 
> Because it's a minor/fringe feature, and it's annoying to have platform
> differences (would we plan on relaxing the restriction in v17, or is it
> more likely we'd forget ?).
> 
> I realized how little I've tested with zstd workers myself.  And I think
> on cirrusci, the macos and freebsd tasks have zstd libraries with
> threading support, but it wasn't being exercised (because using :workers
> would cause the patch to fail unless it's supported everywhere).  So I
> updated the "for CI only" patch to 1) use meson wraps to compile zstd
> library with threading on linux and windows; and, 2) use zstd:workers=3
> "opportunistically" (but avoid failing if threads are not supported,
> since the autoconf task still doesn't have access to a library with
> thread support).  That's a great step, but it still seems bad that the
> thread stuff has been little exercised until now.  (Also, the windows
> task failed; I think that's due to a transient network issue).
> 

Agreed, let's leave the threading for PG17, depending on how beneficial
it turns out to be for pg_dump.

> Feel free to mess around with threads (but I'd much rather see the patch
> progress for zstd:long).

OK, understood. The long mode patch is pretty simple. IIUC it does not
change the format, i.e. in the worst case we could leave it for PG17
too. Correct?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-04-01 Thread Justin Pryzby
On Sat, Apr 01, 2023 at 02:49:44PM +0200, Tomas Vondra wrote:
> On 4/1/23 02:28, Justin Pryzby wrote:
> > On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
> >> On 4/1/23 01:16, Justin Pryzby wrote:
> >>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
>  On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
>   wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. 
> > To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
>  It looks like pg_dump's meson.build is missing dependencies on zstd
>  (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd 
> >>> since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for 
> > safety
> > yet, as you mentioned.
> 
>  Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>  handling safety for this, by isolating each thread's state. I don't 
>  feel
>  comfortable pronouncing this new addition safe or not, because I'm 
>  not
>  sure I understand what the comments in the format-specific _Clone()
>  callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls 
> >>> to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  
> >>> If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> >
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> >
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
>  OK. I don't have access to a Windows machine so I can't test that. Is it
>  possible to disable the zstd threading, until we figure this out?
> >>>
> >>> I think that's what's best.  I made it issue a warning if "workers" was
> >>> specified.  It could also be an error, or just ignored.
> >>>
> >>> I considered disabling workers only for windows, but realized that I
> >>> haven't tested with threads myself - my local zstd package is compiled
> >>> without threading, and I remember having some issue recompiling it with
> >>> threading.  Jacob's recipe for using meson wraps works well, but it
> >>> still seems better to leave it as a future feature.  I used that recipe
> >>> to enabled zstd with threading on CI (except for linux/autoconf).
> >>
> >> +1 to disable this if we're unsure it works correctly. I agree it's
> >> better to just error out if workers are requested - I rather dislike
> >> when a tool just ignores an explicit parameter. And AFAICS it's what
> >> zstd does too, when someone requests workers on incompatible build.
> >>
> >> FWIW I've been thinking about this a bit more and I don't quite see why
> >> would the threading cause issues (except for Windows). I forgot
> >> pg_basebackup already supports zstd, including the worker threading, so
> >> why would it work there and not in pg_dump? Sure, pg_basebackup is not
> >> parallel, but with separate pg_dump processes that shouldn't be an issue
> >> (although I'm not sure when zstd creates threads).
> > 
> > There's no concern at all except under windows (because on windows
> > pg_dump -j is implemented using threads rather than forking).
> > Especially since zstd:workers is already allowed in the basebackup
> > backend process.
> 
> If there are no concerns, why disable it outside Windows? I don't have a
> good idea how beneficial the multi-threaded compression is, so I 

Re: zstd compression for pg_dump

2023-04-01 Thread Tomas Vondra



On 4/1/23 02:28, Justin Pryzby wrote:
> On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
>> On 4/1/23 01:16, Justin Pryzby wrote:
>>> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
 On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
 On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
  wrote:
> I did some smoke testing against zstd's GitHub release on Windows. To
> build against it, I had to construct an import library, and put that
> and the DLL into the `lib` folder expected by the MSVC scripts...
> which makes me wonder if I've chosen a harder way than necessary?

 It looks like pg_dump's meson.build is missing dependencies on zstd
 (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it.  Could you check that what I've
>>> added works for your case ?
>>>
> Parallel zstd dumps seem to work as expected, in that the resulting
> pg_restore output is identical to uncompressed dumps and nothing
> explodes. I haven't inspected the threading implementation for safety
> yet, as you mentioned.

 Hm. Best I can tell, the CloneArchive() machinery is supposed to be
 handling safety for this, by isolating each thread's state. I don't 
 feel
 comfortable pronouncing this new addition safe or not, because I'm not
 sure I understand what the comments in the format-specific _Clone()
 callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
>
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.  
>
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread.  I suspect there's no issue,
> but we still ought to verify that before declaring it safe.

 OK. I don't have access to a Windows machine so I can't test that. Is it
 possible to disable the zstd threading, until we figure this out?
>>>
>>> I think that's what's best.  I made it issue a warning if "workers" was
>>> specified.  It could also be an error, or just ignored.
>>>
>>> I considered disabling workers only for windows, but realized that I
>>> haven't tested with threads myself - my local zstd package is compiled
>>> without threading, and I remember having some issue recompiling it with
>>> threading.  Jacob's recipe for using meson wraps works well, but it
>>> still seems better to leave it as a future feature.  I used that recipe
>>> to enabled zstd with threading on CI (except for linux/autoconf).
>>
>> +1 to disable this if we're unsure it works correctly. I agree it's
>> better to just error out if workers are requested - I rather dislike
>> when a tool just ignores an explicit parameter. And AFAICS it's what
>> zstd does too, when someone requests workers on incompatible build.
>>
>> FWIW I've been thinking about this a bit more and I don't quite see why
>> would the threading cause issues (except for Windows). I forgot
>> pg_basebackup already supports zstd, including the worker threading, so
>> why would it work there and not in pg_dump? Sure, pg_basebackup is not
>> parallel, but with separate pg_dump processes that shouldn't be an issue
>> (although I'm not sure when zstd creates threads).
> 
> There's no concern at all except under windows (because on windows
> pg_dump -j is implemented using threads rather than forking).
> Especially since zstd:workers is already allowed in the basebackup
> backend process.
> 

If there are no concerns, why disable it outside Windows? I don't have a
good idea how beneficial the multi-threaded compression is, so I can't
quite judge the risk/benefits tradeoff.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-03-31 Thread Justin Pryzby
On Sat, Apr 01, 2023 at 02:11:12AM +0200, Tomas Vondra wrote:
> On 4/1/23 01:16, Justin Pryzby wrote:
> > On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> >> On 3/27/23 19:28, Justin Pryzby wrote:
> >>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>  On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
> >>  wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> >
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> >
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't 
> >> feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> >
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
>  I may be missing something, but why would the patch affect this? Why
>  would it even affect safety of the parallel dump? And I don't see any
>  changes to the clone stuff ...
> >>>
> >>> zstd supports using threads during compression, with -Z zstd:workers=N.
> >>> When unix forks, the child processes can't do anything to mess up the
> >>> state of the parent processes.  
> >>>
> >>> But windows pg_dump uses threads instead of forking, so it seems
> >>> possible that the pg_dump -j threads that then spawn zstd threads could
> >>> "leak threads" and break the main thread.  I suspect there's no issue,
> >>> but we still ought to verify that before declaring it safe.
> >>
> >> OK. I don't have access to a Windows machine so I can't test that. Is it
> >> possible to disable the zstd threading, until we figure this out?
> > 
> > I think that's what's best.  I made it issue a warning if "workers" was
> > specified.  It could also be an error, or just ignored.
> > 
> > I considered disabling workers only for windows, but realized that I
> > haven't tested with threads myself - my local zstd package is compiled
> > without threading, and I remember having some issue recompiling it with
> > threading.  Jacob's recipe for using meson wraps works well, but it
> > still seems better to leave it as a future feature.  I used that recipe
> > to enabled zstd with threading on CI (except for linux/autoconf).
> 
> +1 to disable this if we're unsure it works correctly. I agree it's
> better to just error out if workers are requested - I rather dislike
> when a tool just ignores an explicit parameter. And AFAICS it's what
> zstd does too, when someone requests workers on incompatible build.
> 
> FWIW I've been thinking about this a bit more and I don't quite see why
> would the threading cause issues (except for Windows). I forgot
> pg_basebackup already supports zstd, including the worker threading, so
> why would it work there and not in pg_dump? Sure, pg_basebackup is not
> parallel, but with separate pg_dump processes that shouldn't be an issue
> (although I'm not sure when zstd creates threads).

There's no concern at all except under windows (because on windows
pg_dump -j is implemented using threads rather than forking).
Especially since zstd:workers is already allowed in the basebackup
backend process.

> I'll try building zstd with threading enabled, and do some tests over
> the weekend.

Feel free to wait until v17 :)

I used "meson wraps" to get a local version with threading.  Note that
if you want to use a zstd subproject, you may have to specify -D
zstd=enabled, or else meson may not enable the library at all.

Also, in order to introspect its settings, I had to do like this:

mkdir subprojects
meson wrap install zstd
meson subprojects download
mkdir build.meson
meson setup -C build.meson 

Re: zstd compression for pg_dump

2023-03-31 Thread Tomas Vondra
On 4/1/23 01:16, Justin Pryzby wrote:
> On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
>> On 3/27/23 19:28, Justin Pryzby wrote:
>>> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
 On 3/16/23 05:50, Justin Pryzby wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
>> wrote:
>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>> build against it, I had to construct an import library, and put that
>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>> which makes me wonder if I've chosen a harder way than necessary?
>>
>> It looks like pg_dump's meson.build is missing dependencies on zstd
>> (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?
>
>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>> pg_restore output is identical to uncompressed dumps and nothing
>>> explodes. I haven't inspected the threading implementation for safety
>>> yet, as you mentioned.
>>
>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>> handling safety for this, by isolating each thread's state. I don't feel
>> comfortable pronouncing this new addition safe or not, because I'm not
>> sure I understand what the comments in the format-specific _Clone()
>> callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

 I may be missing something, but why would the patch affect this? Why
 would it even affect safety of the parallel dump? And I don't see any
 changes to the clone stuff ...
>>>
>>> zstd supports using threads during compression, with -Z zstd:workers=N.
>>> When unix forks, the child processes can't do anything to mess up the
>>> state of the parent processes.  
>>>
>>> But windows pg_dump uses threads instead of forking, so it seems
>>> possible that the pg_dump -j threads that then spawn zstd threads could
>>> "leak threads" and break the main thread.  I suspect there's no issue,
>>> but we still ought to verify that before declaring it safe.
>>
>> OK. I don't have access to a Windows machine so I can't test that. Is it
>> possible to disable the zstd threading, until we figure this out?
> 
> I think that's what's best.  I made it issue a warning if "workers" was
> specified.  It could also be an error, or just ignored.
> 
> I considered disabling workers only for windows, but realized that I
> haven't tested with threads myself - my local zstd package is compiled
> without threading, and I remember having some issue recompiling it with
> threading.  Jacob's recipe for using meson wraps works well, but it
> still seems better to leave it as a future feature.  I used that recipe
> to enabled zstd with threading on CI (except for linux/autoconf).
> 

+1 to disable this if we're unsure it works correctly. I agree it's
better to just error out if workers are requested - I rather dislike
when a tool just ignores an explicit parameter. And AFAICS it's what
zstd does too, when someone requests workers on incompatible build.

FWIW I've been thinking about this a bit more and I don't quite see why
would the threading cause issues (except for Windows). I forgot
pg_basebackup already supports zstd, including the worker threading, so
why would it work there and not in pg_dump? Sure, pg_basebackup is not
parallel, but with separate pg_dump processes that shouldn't be an issue
(although I'm not sure when zstd creates threads).

The one thing I'm wondering about is at which point are the worker
threads spawned - but presumably not before the pg_dump processes fork.

I'll try building zstd with threading enabled, and do some tests over
the weekend.

> The function is first checking if it was passed a filename which already
> has a suffix.  And if not, it searches through a list of suffixes,
> testing for an existing file with each suffix.  The search with stat()
> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> hasSuffix() branch isn't dead code.  Another opened question.

 AFAICS it's done this way because of this comment in pg_backup_directory

  * ...
  * ".gz" suffix is added to the filenames. The TOC files are never
  * compressed by pg_dump, however they are accepted with the .gz suffix
  * too, in case the user has manually compressed them with 'gzip'.

 I 

Re: zstd compression for pg_dump

2023-03-31 Thread Justin Pryzby
On Tue, Mar 28, 2023 at 06:23:26PM +0200, Tomas Vondra wrote:
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
>  wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
>  It looks like pg_dump's meson.build is missing dependencies on zstd
>  (meson couldn't find the headers in the subproject without them).
> >>>
> >>> I saw that this was added for LZ4, but I hadn't added it for zstd since
> >>> I didn't run into an issue without it.  Could you check that what I've
> >>> added works for your case ?
> >>>
> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
>  Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>  handling safety for this, by isolating each thread's state. I don't feel
>  comfortable pronouncing this new addition safe or not, because I'm not
>  sure I understand what the comments in the format-specific _Clone()
>  callbacks are saying yet.
> >>>
> >>> My line of reasoning for unix is that pg_dump forks before any calls to
> >>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> >>> doesn't apply to pg_dump under windows.  This is an opened question.  If
> >>> there's no solid answer, I could disable/ignore the option (maybe only
> >>> under windows).
> >>
> >> I may be missing something, but why would the patch affect this? Why
> >> would it even affect safety of the parallel dump? And I don't see any
> >> changes to the clone stuff ...
> > 
> > zstd supports using threads during compression, with -Z zstd:workers=N.
> > When unix forks, the child processes can't do anything to mess up the
> > state of the parent processes.  
> > 
> > But windows pg_dump uses threads instead of forking, so it seems
> > possible that the pg_dump -j threads that then spawn zstd threads could
> > "leak threads" and break the main thread.  I suspect there's no issue,
> > but we still ought to verify that before declaring it safe.
> 
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?

I think that's what's best.  I made it issue a warning if "workers" was
specified.  It could also be an error, or just ignored.

I considered disabling workers only for windows, but realized that I
haven't tested with threads myself - my local zstd package is compiled
without threading, and I remember having some issue recompiling it with
threading.  Jacob's recipe for using meson wraps works well, but it
still seems better to leave it as a future feature.  I used that recipe
to enabled zstd with threading on CI (except for linux/autoconf).

> >>> The function is first checking if it was passed a filename which already
> >>> has a suffix.  And if not, it searches through a list of suffixes,
> >>> testing for an existing file with each suffix.  The search with stat()
> >>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> >>> hasSuffix() branch isn't dead code.  Another opened question.
> >>
> >> AFAICS it's done this way because of this comment in pg_backup_directory
> >>
> >>  * ...
> >>  * ".gz" suffix is added to the filenames. The TOC files are never
> >>  * compressed by pg_dump, however they are accepted with the .gz suffix
> >>  * too, in case the user has manually compressed them with 'gzip'.
> >>
> >> I haven't tried, but I believe that if you manually compress the
> >> directory, it may hit this branch.
> > 
> > That would make sense, but when I tried, it didn't work like that.
> > The filenames were all uncompressed names.  Maybe it worked differently
> > in an older release.  Or maybe it changed during development of the
> > parallel-directory-dump patch and it's actually dead code.
> 
> Interesting. Would be good to find out. I wonder if a little bit of
> git-log digging could tell us more. Anyway, until we confirm it's dead
> code, we should probably do what .gz does and have the same check for
> .lz4 and .zst files.

I found that hasSuffix() and cfopen() originated in the refactored patch
Heikki's sent here; there's no history beyond that.

https://www.postgresql.org/message-id/4D3954C7.9060503%40enterprisedb.com

The patch published there appends the .gz within cfopen(), and the
caller writes into the TOC the filename without ".gz".  It seems like

Re: zstd compression for pg_dump

2023-03-29 Thread Jacob Champion
On Wed, Mar 29, 2023 at 6:35 AM Justin Pryzby  wrote:
> If you have a zstd library with thread support, you could test with
> -Z zstd:workers=3.  But I think threads aren't enabled in the common
> libzstd packages.  Jacob figured out how to compile libzstd easily using
> "meson wraps" - but I don't know the details.

>From the source root,

$ mkdir subprojects
$ meson wrap install zstd

>From then on, Meson was pretty automagical about it during the ninja
build. The subproject's settings are themselves inspectable and
settable via `meson configure`:

$ meson configure -Dzstd:=

--Jacob




Re: zstd compression for pg_dump

2023-03-29 Thread Justin Pryzby
On Tue, Mar 28, 2023 at 02:03:49PM -0400, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra  
> wrote:
> > On 3/27/23 19:28, Justin Pryzby wrote:
> > > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> > >> On 3/16/23 05:50, Justin Pryzby wrote:
> > >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
> >   wrote:
> > > I did some smoke testing against zstd's GitHub release on Windows. To
> > ...
> > OK. I don't have access to a Windows machine so I can't test that. Is it
> > possible to disable the zstd threading, until we figure this out?
>
> Thomas since I appear to be one of the few windows users (I use both), can I 
> help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day
> on windows while developing.

It'd be great if you'd exercise this and other changes to
pg_dump/restore.  Tomas just pushed a bugfix, so be sure to "git pull"
before testing, or else you might rediscover the bug.

If you have a zstd library with thread support, you could test with
-Z zstd:workers=3.  But I think threads aren't enabled in the common
libzstd packages.  Jacob figured out how to compile libzstd easily using
"meson wraps" - but I don't know the details.

-- 
Justin




Re: zstd compression for pg_dump

2023-03-28 Thread Jacob Champion
On Wed, Mar 15, 2023 at 9:50 PM Justin Pryzby  wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> > It looks like pg_dump's meson.build is missing dependencies on zstd
> > (meson couldn't find the headers in the subproject without them).
>
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?

I thought I replied to this, sorry -- your newest patchset builds
correctly with subprojects, so the new dependency looks good to me.
Thanks!

> > Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> > handling safety for this, by isolating each thread's state. I don't feel
> > comfortable pronouncing this new addition safe or not, because I'm not
> > sure I understand what the comments in the format-specific _Clone()
> > callbacks are saying yet.
>
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).

To (maybe?) move this forward a bit, note that pg_backup_custom's
_Clone() function makes sure that there is no active compressor state
at the beginning of the new thread. pg_backup_directory's
implementation has no such provision. And I don't think it can,
because the parent thread might have concurrently set one up -- see
the directory-specific implementation of _CloseArchive(). Perhaps we
should just NULL out those fields after the copy, instead?

To illustrate why I think this is tough to characterize: if I've read
the code correctly, the _Clone() and CloneArchive() implementations
are running concurrently with code that is actively modifying the
ArchiveHandle and the lclContext. So safety is only ensured to the
extent that we keep track of which fields threads are allowed to
touch, and I don't have that mental model.

--Jacob




Re: zstd compression for pg_dump

2023-03-28 Thread Tomas Vondra
On 3/28/23 20:03, Kirk Wolak wrote:
> On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra
> mailto:tomas.von...@enterprisedb.com>>
> wrote:
> 
> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion
> mailto:jchamp...@timescale.com>> wrote:
> > I did some smoke testing against zstd's GitHub release on
> Windows. To
> ...
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?
> 
> Thomas since I appear to be one of the few windows users (I use both),
> can I help?
> I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a
> day on windows while developing.
> 

Perhaps. But I'll leave the details up to Justin - it's his patch, and
I'm not sure how to verify the threading is OK.

I'd try applying this patch, build with --with-zstd and then run the
pg_dump TAP tests, and perhaps do some manual tests.

And perhaps do the same for --with-lz4 - there's a thread [1] suggesting
we don't detect lz4 stuff on Windows, so the TAP tests do nothing.

https://www.postgresql.org/message-id/zajl96n9zw84u...@msg.df7cb.de

> Also, I have an AWS instance I created to build PG/Win with readline
> back in November.
> I could give you access to that...  (you are not the only person who has
> made this statement here).
> I've made such instances available for other Open Source developers, to
> support them.
> 
>  Obvi I would share connection credentials privately.
> 

I'd rather leave the Windows stuff up to someone with more experience
with that platform. I have plenty of other stuff on my plate atm.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-03-28 Thread Kirk Wolak
On Tue, Mar 28, 2023 at 12:23 PM Tomas Vondra 
wrote:

> On 3/27/23 19:28, Justin Pryzby wrote:
> > On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> >> On 3/16/23 05:50, Justin Pryzby wrote:
> >>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>  On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion <
> jchamp...@timescale.com> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> ...
> OK. I don't have access to a Windows machine so I can't test that. Is it
> possible to disable the zstd threading, until we figure this out?
>
> Thomas since I appear to be one of the few windows users (I use both), can
I help?
I can test pg_dump... for you, easy to do.  I do about 5-10 pg_dumps a day
on windows while developing.

Also, I have an AWS instance I created to build PG/Win with readline back
in November.
I could give you access to that...  (you are not the only person who has
made this statement here).
I've made such instances available for other Open Source developers, to
support them.

 Obvi I would share connection credentials privately.

Regards, Kirk


Re: zstd compression for pg_dump

2023-03-28 Thread Tomas Vondra



On 3/27/23 19:28, Justin Pryzby wrote:
> On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
>> On 3/16/23 05:50, Justin Pryzby wrote:
>>> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
 On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
 wrote:
> I did some smoke testing against zstd's GitHub release on Windows. To
> build against it, I had to construct an import library, and put that
> and the DLL into the `lib` folder expected by the MSVC scripts...
> which makes me wonder if I've chosen a harder way than necessary?

 It looks like pg_dump's meson.build is missing dependencies on zstd
 (meson couldn't find the headers in the subproject without them).
>>>
>>> I saw that this was added for LZ4, but I hadn't added it for zstd since
>>> I didn't run into an issue without it.  Could you check that what I've
>>> added works for your case ?
>>>
> Parallel zstd dumps seem to work as expected, in that the resulting
> pg_restore output is identical to uncompressed dumps and nothing
> explodes. I haven't inspected the threading implementation for safety
> yet, as you mentioned.

 Hm. Best I can tell, the CloneArchive() machinery is supposed to be
 handling safety for this, by isolating each thread's state. I don't feel
 comfortable pronouncing this new addition safe or not, because I'm not
 sure I understand what the comments in the format-specific _Clone()
 callbacks are saying yet.
>>>
>>> My line of reasoning for unix is that pg_dump forks before any calls to
>>> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
>>> doesn't apply to pg_dump under windows.  This is an opened question.  If
>>> there's no solid answer, I could disable/ignore the option (maybe only
>>> under windows).
>>
>> I may be missing something, but why would the patch affect this? Why
>> would it even affect safety of the parallel dump? And I don't see any
>> changes to the clone stuff ...
> 
> zstd supports using threads during compression, with -Z zstd:workers=N.
> When unix forks, the child processes can't do anything to mess up the
> state of the parent processes.  
> 
> But windows pg_dump uses threads instead of forking, so it seems
> possible that the pg_dump -j threads that then spawn zstd threads could
> "leak threads" and break the main thread.  I suspect there's no issue,
> but we still ought to verify that before declaring it safe.
> 

OK. I don't have access to a Windows machine so I can't test that. Is it
possible to disable the zstd threading, until we figure this out?

>>> The function is first checking if it was passed a filename which already
>>> has a suffix.  And if not, it searches through a list of suffixes,
>>> testing for an existing file with each suffix.  The search with stat()
>>> doesn't happen if it has a suffix.  I'm having trouble seeing how the
>>> hasSuffix() branch isn't dead code.  Another opened question.
>>
>> AFAICS it's done this way because of this comment in pg_backup_directory
>>
>>  * ...
>>  * ".gz" suffix is added to the filenames. The TOC files are never
>>  * compressed by pg_dump, however they are accepted with the .gz suffix
>>  * too, in case the user has manually compressed them with 'gzip'.
>>
>> I haven't tried, but I believe that if you manually compress the
>> directory, it may hit this branch.
> 
> That would make sense, but when I tried, it didn't work like that.
> The filenames were all uncompressed names.  Maybe it worked differently
> in an older release.  Or maybe it changed during development of the
> parallel-directory-dump patch and it's actually dead code.
> 

Interesting. Would be good to find out. I wonder if a little bit of
git-log digging could tell us more. Anyway, until we confirm it's dead
code, we should probably do what .gz does and have the same check for
.lz4 and .zst files.

> This is rebased over the updated compression API.
> 
> It seems like I misunderstood something you said before, so now I put
> back "supports_compression()".
> 

Thanks! I need to do a bit more testing and review, but it seems pretty
much RFC to me, assuming we can figure out what to do about threading.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-03-27 Thread Justin Pryzby
On Fri, Mar 17, 2023 at 03:43:31AM +0100, Tomas Vondra wrote:
> On 3/16/23 05:50, Justin Pryzby wrote:
> > On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> >> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
> >> wrote:
> >>> I did some smoke testing against zstd's GitHub release on Windows. To
> >>> build against it, I had to construct an import library, and put that
> >>> and the DLL into the `lib` folder expected by the MSVC scripts...
> >>> which makes me wonder if I've chosen a harder way than necessary?
> >>
> >> It looks like pg_dump's meson.build is missing dependencies on zstd
> >> (meson couldn't find the headers in the subproject without them).
> > 
> > I saw that this was added for LZ4, but I hadn't added it for zstd since
> > I didn't run into an issue without it.  Could you check that what I've
> > added works for your case ?
> > 
> >>> Parallel zstd dumps seem to work as expected, in that the resulting
> >>> pg_restore output is identical to uncompressed dumps and nothing
> >>> explodes. I haven't inspected the threading implementation for safety
> >>> yet, as you mentioned.
> >>
> >> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> >> handling safety for this, by isolating each thread's state. I don't feel
> >> comfortable pronouncing this new addition safe or not, because I'm not
> >> sure I understand what the comments in the format-specific _Clone()
> >> callbacks are saying yet.
> > 
> > My line of reasoning for unix is that pg_dump forks before any calls to
> > zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> > doesn't apply to pg_dump under windows.  This is an opened question.  If
> > there's no solid answer, I could disable/ignore the option (maybe only
> > under windows).
> 
> I may be missing something, but why would the patch affect this? Why
> would it even affect safety of the parallel dump? And I don't see any
> changes to the clone stuff ...

zstd supports using threads during compression, with -Z zstd:workers=N.
When unix forks, the child processes can't do anything to mess up the
state of the parent processes.  

But windows pg_dump uses threads instead of forking, so it seems
possible that the pg_dump -j threads that then spawn zstd threads could
"leak threads" and break the main thread.  I suspect there's no issue,
but we still ought to verify that before declaring it safe.

> > The function is first checking if it was passed a filename which already
> > has a suffix.  And if not, it searches through a list of suffixes,
> > testing for an existing file with each suffix.  The search with stat()
> > doesn't happen if it has a suffix.  I'm having trouble seeing how the
> > hasSuffix() branch isn't dead code.  Another opened question.
> 
> AFAICS it's done this way because of this comment in pg_backup_directory
> 
>  * ...
>  * ".gz" suffix is added to the filenames. The TOC files are never
>  * compressed by pg_dump, however they are accepted with the .gz suffix
>  * too, in case the user has manually compressed them with 'gzip'.
> 
> I haven't tried, but I believe that if you manually compress the
> directory, it may hit this branch.

That would make sense, but when I tried, it didn't work like that.
The filenames were all uncompressed names.  Maybe it worked differently
in an older release.  Or maybe it changed during development of the
parallel-directory-dump patch and it's actually dead code.

This is rebased over the updated compression API.

It seems like I misunderstood something you said before, so now I put
back "supports_compression()".

-- 
Justin
>From b4cf9df2f8672b012301d58ff83f2b0344de9e96 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 7 Jan 2023 15:45:06 -0600
Subject: [PATCH 1/3] pg_dump: zstd compression

Previously proposed at: 20201221194924.gi30...@telsasoft.com
---
 doc/src/sgml/ref/pg_dump.sgml |  13 +-
 src/bin/pg_dump/Makefile  |   2 +
 src/bin/pg_dump/compress_io.c |  62 ++-
 src/bin/pg_dump/compress_zstd.c   | 535 ++
 src/bin/pg_dump/compress_zstd.h   |  25 ++
 src/bin/pg_dump/meson.build   |   4 +-
 src/bin/pg_dump/pg_backup_archiver.c  |   9 +-
 src/bin/pg_dump/pg_backup_directory.c |   2 +
 src/bin/pg_dump/pg_dump.c |  16 +-
 src/bin/pg_dump/t/002_pg_dump.pl  |  79 +++-
 src/tools/pginclude/cpluspluscheck|   1 +
 11 files changed, 694 insertions(+), 54 deletions(-)
 create mode 100644 src/bin/pg_dump/compress_zstd.c
 create mode 100644 src/bin/pg_dump/compress_zstd.h

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index d6b1faa8042..62b3ed2dad6 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -330,8 +330,9 @@ PostgreSQL documentation
machine-readable format that pg_restore
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed 

Re: zstd compression for pg_dump

2023-03-27 Thread Tomas Vondra
Hi,

On 3/17/23 03:43, Tomas Vondra wrote:
> 
> ...
>
>>> I'm a little suspicious of the replacement of supports_compression()
>>> with parse_compress_specification(). For example:
>>>
 -   errmsg = supports_compression(AH->compression_spec);
 -   if (errmsg)
 +   parse_compress_specification(AH->compression_spec.algorithm,
 +   NULL, _spec);
 +   if (compress_spec.parse_error != NULL)
 {
 pg_log_warning("archive is compressed, but this installation does 
 not support compression (%s
 -   errmsg);
 -   pg_free(errmsg);
 +   compress_spec.parse_error);
 +   pg_free(compress_spec.parse_error);
 }
>>>
>>> The top-level error here is "does not support compression", but wouldn't
>>> a bad specification option with a supported compression method trip this
>>> path too?
>>
>> No - since the 2nd argument is passed as NULL, it just checks whether
>> the compression is supported.  Maybe there ought to be a more
>> direct/clean way to do it.  But up to now evidently nobody needed to do
>> that.
>>
> 
> I don't think the patch can use parse_compress_specification() instead
> of replace supports_compression(). The parsing simply determines if the
> build has the library, it doesn't say if a particular tool was modified
> to support the algorithm. I might build --with-zstd and yet pg_dump does
> not support that algorithm yet.
> 
> Even after we add zstd to pg_dump, it's quite likely other compression
> algorithms may not be supported by pg_dump from day 1.
> 
> 
> I haven't looked at / tested the patch yet, but I wonder if you have any
> thoughts regarding the size_t / int tweaks. I don't know what types zstd
> library uses, how it reports errors etc.
> 

Any thoughts regarding my comments on removing supports_compression()?

Also, this patch needs a rebase to adopt it to the API changes from last
week. The sooner the better, considering we're getting fairly close to
the end of the CF and code freeze.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-03-16 Thread Tomas Vondra



On 3/16/23 05:50, Justin Pryzby wrote:
> On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
>> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
>> wrote:
>>> I did some smoke testing against zstd's GitHub release on Windows. To
>>> build against it, I had to construct an import library, and put that
>>> and the DLL into the `lib` folder expected by the MSVC scripts...
>>> which makes me wonder if I've chosen a harder way than necessary?
>>
>> It looks like pg_dump's meson.build is missing dependencies on zstd
>> (meson couldn't find the headers in the subproject without them).
> 
> I saw that this was added for LZ4, but I hadn't added it for zstd since
> I didn't run into an issue without it.  Could you check that what I've
> added works for your case ?
> 
>>> Parallel zstd dumps seem to work as expected, in that the resulting
>>> pg_restore output is identical to uncompressed dumps and nothing
>>> explodes. I haven't inspected the threading implementation for safety
>>> yet, as you mentioned.
>>
>> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
>> handling safety for this, by isolating each thread's state. I don't feel
>> comfortable pronouncing this new addition safe or not, because I'm not
>> sure I understand what the comments in the format-specific _Clone()
>> callbacks are saying yet.
> 
> My line of reasoning for unix is that pg_dump forks before any calls to
> zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
> doesn't apply to pg_dump under windows.  This is an opened question.  If
> there's no solid answer, I could disable/ignore the option (maybe only
> under windows).
> 

I may be missing something, but why would the patch affect this? Why
would it even affect safety of the parallel dump? And I don't see any
changes to the clone stuff ...

>> On to code (not a complete review):
>>
>>> if (hasSuffix(fname, ".gz"))
>>> compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> else
>>> {
>>> boolexists;
>>>
>>> exists = (stat(path, ) == 0);
>>> /* avoid unused warning if it is not built with compression */
>>> if (exists)
>>> compression_spec.algorithm = PG_COMPRESSION_NONE;
>>> -#ifdef HAVE_LIBZ
>>> -   if (!exists)
>>> -   {
>>> -   free_keep_errno(fname);
>>> -   fname = psprintf("%s.gz", path);
>>> -   exists = (stat(fname, ) == 0);
>>> -
>>> -   if (exists)
>>> -   compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> -   }
>>> -#endif
>>> -#ifdef USE_LZ4
>>> -   if (!exists)
>>> -   {
>>> -   free_keep_errno(fname);
>>> -   fname = psprintf("%s.lz4", path);
>>> -   exists = (stat(fname, ) == 0);
>>> -
>>> -   if (exists)
>>> -   compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> -   }
>>> -#endif
>>> +   else if (check_compressed_file(path, , "gz"))
>>> +   compression_spec.algorithm = PG_COMPRESSION_GZIP;
>>> +   else if (check_compressed_file(path, , "lz4"))
>>> +   compression_spec.algorithm = PG_COMPRESSION_LZ4;
>>> +   else if (check_compressed_file(path, , "zst"))
>>> +   compression_spec.algorithm = PG_COMPRESSION_ZSTD;
>>> }
>>
>> This function lost some coherence, I think. Should there be a hasSuffix
>> check at the top for ".zstd" (and, for that matter, ".lz4")?
> 

This was discussed in the lz4 thread a couple days, and I think there
should be hasSuffix() cases for lz4/zstd too, not just for .gz.

> The function is first checking if it was passed a filename which already
> has a suffix.  And if not, it searches through a list of suffixes,
> testing for an existing file with each suffix.  The search with stat()
> doesn't happen if it has a suffix.  I'm having trouble seeing how the
> hasSuffix() branch isn't dead code.  Another opened question.
> 

AFAICS it's done this way because of this comment in pg_backup_directory

 * ...
 * ".gz" suffix is added to the filenames. The TOC files are never
 * compressed by pg_dump, however they are accepted with the .gz suffix
 * too, in case the user has manually compressed them with 'gzip'.

I haven't tried, but I believe that if you manually compress the
directory, it may hit this branch. And IMO if we support that for gzip,
the other compression methods should do that too for consistency.

In any case, it's a tiny amount of code and I don't feel like ripping
that out when it might break some currently supported use case.

>> I'm a little suspicious of the replacement of supports_compression()
>> with parse_compress_specification(). For example:
>>
>>> -   errmsg = supports_compression(AH->compression_spec);
>>> -   if (errmsg)
>>> +   parse_compress_specification(AH->compression_spec.algorithm,
>>> +   NULL, _spec);
>>> +   if (compress_spec.parse_error != NULL)
>>> {
>>> pg_log_warning("archive is compressed, but this installation does 
>>> not 

Re: zstd compression for pg_dump

2023-03-15 Thread Justin Pryzby
On Fri, Mar 10, 2023 at 12:48:13PM -0800, Jacob Champion wrote:
> On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion  
> wrote:
> > I did some smoke testing against zstd's GitHub release on Windows. To
> > build against it, I had to construct an import library, and put that
> > and the DLL into the `lib` folder expected by the MSVC scripts...
> > which makes me wonder if I've chosen a harder way than necessary?
> 
> It looks like pg_dump's meson.build is missing dependencies on zstd
> (meson couldn't find the headers in the subproject without them).

I saw that this was added for LZ4, but I hadn't added it for zstd since
I didn't run into an issue without it.  Could you check that what I've
added works for your case ?

> > Parallel zstd dumps seem to work as expected, in that the resulting
> > pg_restore output is identical to uncompressed dumps and nothing
> > explodes. I haven't inspected the threading implementation for safety
> > yet, as you mentioned.
> 
> Hm. Best I can tell, the CloneArchive() machinery is supposed to be
> handling safety for this, by isolating each thread's state. I don't feel
> comfortable pronouncing this new addition safe or not, because I'm not
> sure I understand what the comments in the format-specific _Clone()
> callbacks are saying yet.

My line of reasoning for unix is that pg_dump forks before any calls to
zstd.  Nothing zstd does ought to affect the pg_dump layer.  But that
doesn't apply to pg_dump under windows.  This is an opened question.  If
there's no solid answer, I could disable/ignore the option (maybe only
under windows).

> On to code (not a complete review):
> 
> > if (hasSuffix(fname, ".gz"))
> > compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > else
> > {
> > boolexists;
> > 
> > exists = (stat(path, ) == 0);
> > /* avoid unused warning if it is not built with compression */
> > if (exists)
> > compression_spec.algorithm = PG_COMPRESSION_NONE;
> > -#ifdef HAVE_LIBZ
> > -   if (!exists)
> > -   {
> > -   free_keep_errno(fname);
> > -   fname = psprintf("%s.gz", path);
> > -   exists = (stat(fname, ) == 0);
> > -
> > -   if (exists)
> > -   compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > -   }
> > -#endif
> > -#ifdef USE_LZ4
> > -   if (!exists)
> > -   {
> > -   free_keep_errno(fname);
> > -   fname = psprintf("%s.lz4", path);
> > -   exists = (stat(fname, ) == 0);
> > -
> > -   if (exists)
> > -   compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > -   }
> > -#endif
> > +   else if (check_compressed_file(path, , "gz"))
> > +   compression_spec.algorithm = PG_COMPRESSION_GZIP;
> > +   else if (check_compressed_file(path, , "lz4"))
> > +   compression_spec.algorithm = PG_COMPRESSION_LZ4;
> > +   else if (check_compressed_file(path, , "zst"))
> > +   compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> > }
> 
> This function lost some coherence, I think. Should there be a hasSuffix
> check at the top for ".zstd" (and, for that matter, ".lz4")?

The function is first checking if it was passed a filename which already
has a suffix.  And if not, it searches through a list of suffixes,
testing for an existing file with each suffix.  The search with stat()
doesn't happen if it has a suffix.  I'm having trouble seeing how the
hasSuffix() branch isn't dead code.  Another opened question.

> I'm a little suspicious of the replacement of supports_compression()
> with parse_compress_specification(). For example:
> 
> > -   errmsg = supports_compression(AH->compression_spec);
> > -   if (errmsg)
> > +   parse_compress_specification(AH->compression_spec.algorithm,
> > +   NULL, _spec);
> > +   if (compress_spec.parse_error != NULL)
> > {
> > pg_log_warning("archive is compressed, but this installation does 
> > not support compression (%s
> > -   errmsg);
> > -   pg_free(errmsg);
> > +   compress_spec.parse_error);
> > +   pg_free(compress_spec.parse_error);
> > }
> 
> The top-level error here is "does not support compression", but wouldn't
> a bad specification option with a supported compression method trip this
> path too?

No - since the 2nd argument is passed as NULL, it just checks whether
the compression is supported.  Maybe there ought to be a more
direct/clean way to do it.  But up to now evidently nobody needed to do
that.

> > +static void
> > +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> > + ZSTD_cParameter param, int value, char *paramname)
> 
> IMO we should avoid stepping on the ZSTD_ namespace with our own
> internal function names.

done

> > +   if (cs->readF != NULL)
> > +
> > +   if (cs->writeF != NULL)
> 
> This seems to suggest that both cs->readF and cs->writeF could be set,
> but in that case, the output buffer gets 

Re: zstd compression for pg_dump

2023-03-10 Thread Jacob Champion
Hi,

This'll need another rebase over the meson ICU changes.

On Wed, Mar 8, 2023 at 10:59 AM Jacob Champion 
wrote:
> I did some smoke testing against zstd's GitHub release on Windows. To
> build against it, I had to construct an import library, and put that
> and the DLL into the `lib` folder expected by the MSVC scripts...
> which makes me wonder if I've chosen a harder way than necessary?

A meson wrap made this much easier! It looks like pg_dump's meson.build
is missing dependencies on zstd (meson couldn't find the headers in the
subproject without them).

> Parallel zstd dumps seem to work as expected, in that the resulting
> pg_restore output is identical to uncompressed dumps and nothing
> explodes. I haven't inspected the threading implementation for safety
> yet, as you mentioned.

Hm. Best I can tell, the CloneArchive() machinery is supposed to be
handling safety for this, by isolating each thread's state. I don't feel
comfortable pronouncing this new addition safe or not, because I'm not
sure I understand what the comments in the format-specific _Clone()
callbacks are saying yet.

> And I still wasn't able to test :workers, since
> it looks like the official libzstd for Windows isn't built for
> multithreading. That'll be another day's project.

The wrapped installation enabled threading too, so I was able to try
with :workers=8. Everything seems to work, but I didn't have a dataset
that showed speed improvements at the time. It did seem to affect the
overall compressibility negatively -- which makes sense, I think,
assuming each thread is looking at a separate and/or smaller window.

On to code (not a complete review):

> if (hasSuffix(fname, ".gz"))
> compression_spec.algorithm = PG_COMPRESSION_GZIP;
> else
> {
> boolexists;
> 
> exists = (stat(path, ) == 0);
> /* avoid unused warning if it is not built with compression */
> if (exists)
> compression_spec.algorithm = PG_COMPRESSION_NONE;
> -#ifdef HAVE_LIBZ
> -   if (!exists)
> -   {
> -   free_keep_errno(fname);
> -   fname = psprintf("%s.gz", path);
> -   exists = (stat(fname, ) == 0);
> -
> -   if (exists)
> -   compression_spec.algorithm = PG_COMPRESSION_GZIP;
> -   }
> -#endif
> -#ifdef USE_LZ4
> -   if (!exists)
> -   {
> -   free_keep_errno(fname);
> -   fname = psprintf("%s.lz4", path);
> -   exists = (stat(fname, ) == 0);
> -
> -   if (exists)
> -   compression_spec.algorithm = PG_COMPRESSION_LZ4;
> -   }
> -#endif
> +   else if (check_compressed_file(path, , "gz"))
> +   compression_spec.algorithm = PG_COMPRESSION_GZIP;
> +   else if (check_compressed_file(path, , "lz4"))
> +   compression_spec.algorithm = PG_COMPRESSION_LZ4;
> +   else if (check_compressed_file(path, , "zst"))
> +   compression_spec.algorithm = PG_COMPRESSION_ZSTD;
> }

This function lost some coherence, I think. Should there be a hasSuffix
check at the top for ".zstd" (and, for that matter, ".lz4")? And the
comment references an unused warning, which is only possible with the
#ifdef blocks that were removed.

I'm a little suspicious of the replacement of supports_compression()
with parse_compress_specification(). For example:

> -   errmsg = supports_compression(AH->compression_spec);
> -   if (errmsg)
> +   parse_compress_specification(AH->compression_spec.algorithm,
> +   NULL, _spec);
> +   if (compress_spec.parse_error != NULL)
> {
> pg_log_warning("archive is compressed, but this installation does not 
> support compression (%s
> -   errmsg);
> -   pg_free(errmsg);
> +   compress_spec.parse_error);
> +   pg_free(compress_spec.parse_error);
> }

The top-level error here is "does not support compression", but wouldn't
a bad specification option with a supported compression method trip this
path too?

> +static void
> +ZSTD_CCtx_setParam_or_die(ZSTD_CStream *cstream,
> + ZSTD_cParameter param, int value, char *paramname)

IMO we should avoid stepping on the ZSTD_ namespace with our own
internal function names.

> +   if (cs->readF != NULL)
> +   {
> +   zstdcs->dstream = ZSTD_createDStream();
> +   if (zstdcs->dstream == NULL)
> +   pg_fatal("could not initialize compression library");
> +
> +   zstdcs->input.size = ZSTD_DStreamInSize();
> +   zstdcs->input.src = pg_malloc(zstdcs->input.size);
> +
> +   zstdcs->output.size = ZSTD_DStreamOutSize();
> +   zstdcs->output.dst = pg_malloc(zstdcs->output.size + 1);
> +   }
> +
> +   if (cs->writeF != NULL)
> +   {
> +   zstdcs->cstream = ZstdCStreamParams(cs->compression_spec);
> +
> +   zstdcs->output.size = ZSTD_CStreamOutSize();
> +   zstdcs->output.dst = pg_malloc(zstdcs->output.size);
> +   zstdcs->output.pos = 0;
> +   }

This seems to suggest that 

Re: zstd compression for pg_dump

2023-03-08 Thread Jacob Champion
On Sat, Mar 4, 2023 at 8:57 AM Justin Pryzby  wrote:
> pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM 
> generate_series(1,444)i,generate_series(1,9)j GROUP BY 1;
> $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
> 82023
> $ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
> 1048267

Nice!

I did some smoke testing against zstd's GitHub release on Windows. To
build against it, I had to construct an import library, and put that
and the DLL into the `lib` folder expected by the MSVC scripts...
which makes me wonder if I've chosen a harder way than necessary?

Parallel zstd dumps seem to work as expected, in that the resulting
pg_restore output is identical to uncompressed dumps and nothing
explodes. I haven't inspected the threading implementation for safety
yet, as you mentioned. And I still wasn't able to test :workers, since
it looks like the official libzstd for Windows isn't built for
multithreading. That'll be another day's project.

--Jacob




Re: zstd compression for pg_dump

2023-03-05 Thread Justin Pryzby
On Sat, Feb 25, 2023 at 07:22:27PM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This rebases over the TAP and doc fixes to LZ4.
And adds necessary ENV to makefile and meson.
And adds an annoying boilerplate header.
And removes supports_compression(), which is what I think Tomas meant
when referring to "annoying unsupported cases".
And updates zstd.c: fix an off-by-one, allocate in init depending on
readF/writeF, do not reset the input buffer on each iteration, and show
parameter name in errors.

I'd appreciate help checking that this is doing the right things and
works correctly with zstd threaded workers.  The zstd API says: "use one
different context per thread for parallel execution" and "For parallel
execution, use one separate ZSTD_CStream per thread".
https://github.com/facebook/zstd/blob/dev/lib/zstd.h

I understand that to mean that, if pg_dump *itself* were using threads,
then each thread would need to call ZSTD_createCStream().  pg_dump isn't
threaded, so there's nothing special needed, right?

Except that, under windows, pg_dump -Fd -j actually uses threads instead
of forking.  I *think* that's still safe, since the pgdump threads are
created *before* calling zstd functions (see _PrintTocData and
_StartData of the custom and directory formats), so it happens naturally
that there's a separate zstd stream for each thread of pgdump.

-- 
Justin
>From 6a8b88a3dd37d24ebfdaa6a96505476b8a1efe92 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 7 Jan 2023 15:45:06 -0600
Subject: [PATCH 1/3] pg_dump: zstd compression

Previously proposed at: 20201221194924.gi30...@telsasoft.com
---
 doc/src/sgml/ref/pg_dump.sgml |   8 +-
 src/bin/pg_dump/Makefile  |   2 +
 src/bin/pg_dump/compress_io.c |  79 ++--
 src/bin/pg_dump/compress_zstd.c   | 515 ++
 src/bin/pg_dump/compress_zstd.h   |   9 +
 src/bin/pg_dump/meson.build   |   2 +
 src/bin/pg_dump/pg_backup_archiver.c  |  28 +-
 src/bin/pg_dump/pg_backup_directory.c |   2 +
 src/bin/pg_dump/pg_dump.c |  13 -
 src/bin/pg_dump/t/002_pg_dump.pl  |  79 +++-
 src/tools/pginclude/cpluspluscheck|   1 +
 11 files changed, 654 insertions(+), 84 deletions(-)
 create mode 100644 src/bin/pg_dump/compress_zstd.c
 create mode 100644 src/bin/pg_dump/compress_zstd.h

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 334e4b7fd14..1fb66be1818 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -330,8 +330,9 @@ PostgreSQL documentation
machine-readable format that pg_restore
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed archive
-   can be compressed with the gzip or
-   lz4 tools.
+   can be compressed with the gzip,
+   lz4, or
+   zstd tools.
This format is compressed by default using gzip
and also supports parallel dumps.
   
@@ -655,7 +656,8 @@ PostgreSQL documentation

 Specify the compression method and/or the compression level to use.
 The compression method can be set to gzip,
-lz4, or none for no compression.
+lz4, zstd,
+or none for no compression.
 A compression detail string can optionally be specified.  If the
 detail string is an integer, it specifies the compression level.
 Otherwise, it should be a comma-separated list of items, each of the
diff --git a/src/bin/pg_dump/Makefile b/src/bin/pg_dump/Makefile
index eb8f59459a1..24de7593a6a 100644
--- a/src/bin/pg_dump/Makefile
+++ b/src/bin/pg_dump/Makefile
@@ -18,6 +18,7 @@ include $(top_builddir)/src/Makefile.global
 
 export GZIP_PROGRAM=$(GZIP)
 export LZ4
+export ZSTD
 export with_icu
 
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
@@ -29,6 +30,7 @@ OBJS = \
 	compress_io.o \
 	compress_lz4.o \
 	compress_none.o \
+	compress_zstd.o \
 	dumputils.o \
 	parallel.o \
 	pg_backup_archiver.o \
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index ce06f1eac9c..a3c2f36bb67 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -52,8 +52,8 @@
  *
  *	InitDiscoverCompressFileHandle tries to infer the compression by the
  *	filename suffix. If the suffix is not yet known then it tries to simply
- *	open the file and if it fails, it tries to open the same file with the .gz
- *	suffix, and then 

Re: zstd compression for pg_dump

2023-03-04 Thread Justin Pryzby
On Fri, Mar 03, 2023 at 01:38:05PM -0800, Jacob Champion wrote:
> > > With this particular dataset, I don't see much improvement with
> > > zstd:long.
> >
> > Yeah.  I this could be because either 1) you already got very good
> > comprssion without looking at more data; and/or 2) the neighboring data
> > is already very similar, maybe equally or more similar, than the further
> > data, from which there's nothing to gain.
> 
> What kinds of improvements do you see with your setup? I'm wondering
> when we would suggest that people use it.

On customer data, I see small improvements - below 10%.

But on my first two tries, I made synthetic data sets where it's a lot:

$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long |wc -c
286107
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fp -Z zstd:long=0 |wc -c
1709695

That's just 6 identical tables like:
pryzbyj=# CREATE TABLE t1 AS SELECT generate_series(1,99);

In this case, "custom" format doesn't see that benefit, because the
greatest similarity is across tables, which don't share compressor
state.  But I think the note that I wrote in the docs about that should
be removed - custom format could see a big benefit, as long as the table
is big enough, and there's more similarity/repetition at longer
distances.

Here's one where custom format *does* benefit, due to long-distance
repetition within a single table.  The data is contrived, but the schema
of ID => data is not.  What's notable isn't how compressible the data
is, but how much *more* compressible it is with long-distance matching.

pryzbyj=# CREATE TABLE t1 AS SELECT i,array_agg(j) FROM 
generate_series(1,444)i,generate_series(1,9)j GROUP BY 1;
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=1 |wc -c
82023
$ ./src/bin/pg_dump/pg_dump -d pryzbyj -Fc -Z zstd:long=0 |wc -c
1048267

-- 
Justin




Re: zstd compression for pg_dump

2023-03-03 Thread Jacob Champion
On Fri, Mar 3, 2023 at 10:55 AM Justin Pryzby  wrote:
> Thanks for looking.  If your zstd library is compiled with thread
> support, could you also try with :workers=N ?  I believe this is working
> correctly, but I'm going to ask for help verifying that...

Unfortunately not (Ubuntu 20.04):

pg_dump: error: could not set compression parameter: Unsupported parameter

But that lets me review the error! I think these error messages should
say which options caused them.

> It'd be especially useful to test under windows, where pgdump/restore
> use threads instead of forking...  If you have a windows environment but
> not set up for development, I think it's possible to get cirrusci to
> compile a patch for you and then retrieve the binaries provided as an
> "artifact" (credit/blame for this idea should be directed to Thomas
> Munro).

I should be able to do that next week.

> > With this particular dataset, I don't see much improvement with
> > zstd:long.
>
> Yeah.  I this could be because either 1) you already got very good
> comprssion without looking at more data; and/or 2) the neighboring data
> is already very similar, maybe equally or more similar, than the further
> data, from which there's nothing to gain.

What kinds of improvements do you see with your setup? I'm wondering
when we would suggest that people use it.

> I don't want to start exposing lots of fine-granined parameters at this
> point.  In the immediate case, it looks like it may require more than
> just adding another parameter:
>
>   Note: If windowLog is set to larger than 27,
> --long=windowLog or --memory=windowSize needs to be passed to the
> decompressor.

Hm. That would complicate things.

Thanks,
--Jacob




Re: zstd compression for pg_dump

2023-03-03 Thread Justin Pryzby
On Fri, Mar 03, 2023 at 10:32:53AM -0800, Jacob Champion wrote:
> On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby  wrote:
> > This resolves cfbot warnings: windows and cppcheck.
> > And refactors zstd routines.
> > And updates docs.
> > And includes some fixes for earlier patches that these patches conflicts
> > with/depends on.
> 
> This'll need a rebase (cfbot took a while to catch up).

Soon.

> The patchset includes basebackup modifications, which are part of a
> different CF entry; was that intended?

Yes, it's intentional - if zstd:long mode were to be merged first, then
this patch should include long mode from the start.
Or, if pgdump+zstd were merged first, then long mode could be added to
both places.

> I tried this on a local, 3.5GB, mostly-text table (from the UK Price

Thanks for looking.  If your zstd library is compiled with thread
support, could you also try with :workers=N ?  I believe this is working
correctly, but I'm going to ask for help verifying that...

It'd be especially useful to test under windows, where pgdump/restore
use threads instead of forking...  If you have a windows environment but
not set up for development, I think it's possible to get cirrusci to
compile a patch for you and then retrieve the binaries provided as an
"artifact" (credit/blame for this idea should be directed to Thomas
Munro).

> With this particular dataset, I don't see much improvement with
> zstd:long.

Yeah.  I this could be because either 1) you already got very good
comprssion without looking at more data; and/or 2) the neighboring data
is already very similar, maybe equally or more similar, than the further
data, from which there's nothing to gain.

> (At nearly double the CPU time, I get a <1% improvement in
> compression size.) I assume it's heavily data dependent, but from the
> notes on --long [2] it seems like they expect you to play around with
> the window size to further tailor it to your data. Does it make sense
> to provide the long option without the windowLog parameter?

I don't want to start exposing lots of fine-granined parameters at this
point.  In the immediate case, it looks like it may require more than
just adding another parameter:

  Note: If windowLog is set to larger than 27,
--long=windowLog or --memory=windowSize needs to be passed to the
decompressor.

-- 
Justin




Re: zstd compression for pg_dump

2023-03-03 Thread Jacob Champion
On Sat, Feb 25, 2023 at 5:22 PM Justin Pryzby  wrote:
> This resolves cfbot warnings: windows and cppcheck.
> And refactors zstd routines.
> And updates docs.
> And includes some fixes for earlier patches that these patches conflicts
> with/depends on.

This'll need a rebase (cfbot took a while to catch up). The patchset
includes basebackup modifications, which are part of a different CF
entry; was that intended?

I tried this on a local, 3.5GB, mostly-text table (from the UK Price
Paid dataset [1]) and the comparison against the other methods was
impressive. (I'm no good at constructing compression benchmarks, so
this is a super naive setup. Client's on the same laptop as the
server.)

$ time ./src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
zstd > /tmp/zstd.dump
real1m17.632s
user0m35.521s
sys0m2.683s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
lz4 > /tmp/lz4.dump
real1m13.125s
user0m19.795s
sys0m3.370s

$ time ./\src/bin/pg_dump/pg_dump -d postgres -t pp_complete -Z
gzip > /tmp/gzip.dump
real2m24.523s
user2m22.114s
sys0m1.848s

$ ls -l /tmp/*.dump
-rw-rw-r-- 1 jacob jacob 1331493925 Mar  3 09:45 /tmp/gzip.dump
-rw-rw-r-- 1 jacob jacob 2125998939 Mar  3 09:42 /tmp/lz4.dump
-rw-rw-r-- 1 jacob jacob 1215834718 Mar  3 09:40 /tmp/zstd.dump

Default gzip was the only method that bottlenecked on pg_dump rather
than the server, and default zstd outcompressed it at a fraction of
the CPU time. So, naively, this looks really good.

With this particular dataset, I don't see much improvement with
zstd:long. (At nearly double the CPU time, I get a <1% improvement in
compression size.) I assume it's heavily data dependent, but from the
notes on --long [2] it seems like they expect you to play around with
the window size to further tailor it to your data. Does it make sense
to provide the long option without the windowLog parameter?

Thanks,
--Jacob

[1] https://landregistry.data.gov.uk/
[2] https://github.com/facebook/zstd/releases/tag/v1.3.2




Re: zstd compression for pg_dump

2023-02-25 Thread Justin Pryzby
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

This resolves cfbot warnings: windows and cppcheck.
And refactors zstd routines.
And updates docs.
And includes some fixes for earlier patches that these patches conflicts
with/depends on.
>From ea9b67d09fe1b51e7946cf34fca5795e57dd3858 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 24 Feb 2023 14:07:09 -0600
Subject: [PATCH 1/4] f!fixes for LZ4

---
 doc/src/sgml/ref/pg_dump.sgml| 4 ++--
 src/bin/pg_dump/compress_gzip.c  | 8 
 src/bin/pg_dump/compress_io.c| 2 +-
 src/bin/pg_dump/compress_io.h| 4 ++--
 src/bin/pg_dump/compress_lz4.c   | 4 ++--
 src/bin/pg_dump/pg_backup_archiver.c | 4 ++--
 src/bin/pg_dump/pg_dump.c| 2 --
 src/bin/pg_dump/t/002_pg_dump.pl | 6 +++---
 8 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 49d218905fb..6fbe49f7ede 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -331,7 +331,7 @@ PostgreSQL documentation
can read. A directory format archive can be manipulated with
standard Unix tools; for example, files in an uncompressed archive
can be compressed with the gzip or
-   lz4tool.
+   lz4 tools.
This format is compressed by default using gzip
and also supports parallel dumps.
   
@@ -655,7 +655,7 @@ PostgreSQL documentation

 Specify the compression method and/or the compression level to use.
 The compression method can be set to gzip or
-lz4 or none for no compression.
+lz4, or none for no compression.
 A compression detail string can optionally be specified.  If the
 detail string is an integer, it specifies the compression level.
 Otherwise, it should be a comma-separated list of items, each of the
diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index 0af65afeb4e..52f41c2e58c 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -123,17 +123,9 @@ WriteDataToArchiveGzip(ArchiveHandle *AH, CompressorState *cs,
 		gzipcs->outbuf = pg_malloc(ZLIB_OUT_SIZE + 1);
 		gzipcs->outsize = ZLIB_OUT_SIZE;
 
-		/*
-		 * A level of zero simply copies the input one block at the time. This
-		 * is probably not what the user wanted when calling this interface.
-		 */
-		if (cs->compression_spec.level == 0)
-			pg_fatal("requested to compress the archive yet no level was specified");
-
 		if (deflateInit(zp, cs->compression_spec.level) != Z_OK)
 			pg_fatal("could not initialize compression library: %s", zp->msg);
 
-		/* Just be paranoid - maybe End is called after Start, with no Write */
 		zp->next_out = gzipcs->outbuf;
 		zp->avail_out = gzipcs->outsize;
 	}
diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index ce06f1eac9c..9239dbb2755 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -83,7 +83,7 @@
  * used by the caller in an error message.
  */
 char *
-supports_compression(const pg_compress_specification compression_spec)
+pgdump_supports_compression(const pg_compress_specification compression_spec)
 {
 	const pg_compress_algorithm	algorithm = compression_spec.algorithm;
 	bool		supported = false;
diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index bbde2693915..46815fa2ebe 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -21,7 +21,7 @@
 #define ZLIB_OUT_SIZE	4096
 #define ZLIB_IN_SIZE	4096
 
-extern char *supports_compression(const pg_compress_specification compression_spec);
+extern char *pgdump_supports_compression(const pg_compress_specification compression_spec);
 
 /*
  * Prototype for callback function used in writeData()
@@ -172,7 +172,7 @@ struct CompressFileHandle
 extern CompressFileHandle *InitCompressFileHandle(const pg_compress_specification compression_spec);
 
 /*
- * Initialize a compress file stream. Deffer the compression algorithm
+ * Initialize a compress file stream. Infer the compression algorithm
  * from 'path', either by examining its suffix or by appending the supported
  * suffixes in 'path'.
  */
diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index fe1014e6e77..63e794cdc68 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -161,8 +161,8 @@ typedef struct LZ4File
 } LZ4File;
 
 /*
- * LZ4 equivalent to feof() or gzeof(). The end of file is reached if there
- * is no decompressed output in the overflow buffer and the end of the file
+ * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
+ * decompressed 

Re: zstd compression for pg_dump

2023-02-25 Thread Euler Taveira
On Sat, Feb 25, 2023, at 7:31 AM, Tomas Vondra wrote:
> On 2/24/23 20:18, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> > 
> 
> Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
> because of some issue in pg_backup_archiver.h. But it's a bit bizarre
> because the patch does not modify that file at all ...
cpluspluscheck says

# pg_dump is not C++-clean because it uses "public" and "namespace"
# as field names, which is unfortunate but we won't change it now.

Hence, the patch should exclude the new header file from it.

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -153,6 +153,7 @@ do
test "$f" = src/bin/pg_dump/compress_gzip.h && continue
test "$f" = src/bin/pg_dump/compress_io.h && continue
test "$f" = src/bin/pg_dump/compress_lz4.h && continue
+   test "$f" = src/bin/pg_dump/compress_zstd.h && continue
test "$f" = src/bin/pg_dump/compress_none.h && continue
test "$f" = src/bin/pg_dump/parallel.h && continue
test "$f" = src/bin/pg_dump/pg_backup_archiver.h && continue


--
Euler Taveira
EDB   https://www.enterprisedb.com/


Re: zstd compression for pg_dump

2023-02-25 Thread Tomas Vondra
On 2/24/23 20:18, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/
> 

Thanks. Sadly cfbot is unhappy - the windows and cplusplus builds failed
because of some issue in pg_backup_archiver.h. But it's a bit bizarre
because the patch does not modify that file at all ...

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2023-02-24 Thread Justin Pryzby
On Sat, Feb 25, 2023 at 01:44:36PM +0900, Michael Paquier wrote:
> On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> > This is a draft patch - review is welcome and would help to get this
> > ready to be considererd for v16, if desired.
> > 
> > I'm going to add this thread to the old CF entry.
> > https://commitfest.postgresql.org/31/2888/
> 
> Patch 0003 adds support for the --long option of zstd, meaning that it
> "enables long distance matching with #windowLog".  What's the benefit
> of that when it is applied to dumps and base backup contents?

It (can) makes it smaller.

+   The long keyword enables long-distance matching
+   mode, for improved compression ratio, at the expense of higher 
memory
+   use.  Long-distance mode is supported only for 

+With zstd compression, long mode may allow dumps
+to be significantly smaller, but it might not reduce the size of
+custom or directory format dumps, whose fields are separately 
compressed.

Note that I included that here as 003, but I also have an pre-existing
patch for adding that just to basebackup.

-- 
Justin




Re: zstd compression for pg_dump

2023-02-24 Thread Michael Paquier
On Fri, Feb 24, 2023 at 01:18:40PM -0600, Justin Pryzby wrote:
> This is a draft patch - review is welcome and would help to get this
> ready to be considererd for v16, if desired.
> 
> I'm going to add this thread to the old CF entry.
> https://commitfest.postgresql.org/31/2888/

Patch 0003 adds support for the --long option of zstd, meaning that it
"enables long distance matching with #windowLog".  What's the benefit
of that when it is applied to dumps and base backup contents?
--
Michael


signature.asc
Description: PGP signature


Re: zstd compression for pg_dump

2021-04-08 Thread David Steele

On 1/3/21 9:53 PM, Justin Pryzby wrote:


I rebased so the "typedef struct compression" patch is first and zstd on top of
that (say, in case someone wants to bikeshed about which compression algorithm
to support).  And made a central struct with all the compression-specific info
to further isolate the compress-specific changes.


It has been a few months since there was a new patch and the current one 
no longer applies, so marking Returned with Feedback.


Please resubmit to the next CF when you have a new patch.

Regards,
--
-David
da...@pgmasters.net




Re: zstd compression for pg_dump

2021-03-12 Thread Tomas Vondra



On 1/4/21 11:17 AM, Daniil Zakhlystov wrote:
> Hi!
> 
>> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
>>
>> Daniil, is levels definition compatible with libpq compression patch?
>> +typedef struct Compress {
>> +CompressionAlgorithmalg;
>> +int level;
>> +/* Is a nondefault level set ?  This is useful since different 
>> compression
>> + * methods have different "default" levels.  For now we assume the 
>> levels
>> + * are all integer, though.
>> +*/
>> +boollevel_set;
>> +} Compress;
> 
> Similarly to this patch, it is also possible to define the compression level 
> at the initialization stage in libpq compression patch.
> 
> The difference is that in libpq compression patch the default compression 
> level always equal to 1, independently of the chosen compression algorithm.
> 
>> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
>>
>> Libpq compression encountered some problems with memory consumption which 
>> required some extra config efforts.
> 
> 
>> On Jan 4, 2021, at 12:06 PM, Justin Pryzby  wrote:
>>
>> RAM use is not significantly different from zlib, except that zstd --long 
>> adds
>> more memory.
> 
> Regarding ZSTD memory usage:
> 
> Recently I’ve made a couple of tests of libpq compression with different 
> ZLIB/ZSTD compression levels which shown that compressing/decompressing ZSTD 
> w/ high compression levels 
> require to allocate more virtual (Commited_AS) memory, which may be exploited 
> by malicious clients:
> 
> https://www.postgresql.org/message-id/62527092-16BD-479F-B503-FA527AF3B0C2%40yandex-team.ru
> 
> We can avoid high memory usage by limiting the max window size to 8MB. This 
> should effectively disable the support of compression levels above 19:
> https://www.postgresql.org/message-id/6A45DFAA-1682-4EF2-B835-C5F46615EC49%40yandex-team.ru
> 
> So maybe it is worthwhile to use similar restrictions in this patch.
>

I think there's a big difference between those two patches. In the libpq
case, the danger is that the client requests the server to compress the
data in a way that requires a lot of memory. I.e. the memory is consumed
on the server.

With this pg_dump patch, the compression is done by the pg_dump process,
not the server. So if the attacker configures the compression in a way
that requires a lot of memory, so what? He'll just allocate memory on
the client machine, where he could also just run a custom binary that
does a huge malloc().

So I don't think we need to worry about this too much.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: zstd compression for pg_dump

2021-01-10 Thread Andreas Karlsson

On 1/4/21 3:53 AM, Justin Pryzby wrote:

About 89% smaller.


Did a quick code review of the patch. I have not yet taken it for a spin 
yet and there are parts of the code I have not read yet.


## Is there any reason for this diff?

-cfp*fp = pg_malloc(sizeof(cfp));
+cfp*fp = pg_malloc0(sizeof(cfp));

## Since we know have multiple returns in cfopen() I am not sure that 
setting fp to NULL is still clearer than just returning NULL.


## I do not like that this pretends to support r+, w+ and a+ but does 
not actually do so since it does not create an input stream in those cases.


else if (mode[0] == 'w' || mode[0] == 'a' ||
strchr(mode, '+') != NULL)
[...]
else if (strchr(mode, 'r'))

## Wouldn't cfread(), cfwrite(), cfgetc(), cfgets(), cfclose() and 
cfeof() be cleaner with sitch statments similar to cfopen()?


## "/* Should be called "method" or "library" ? */"

Maybe, but personally I think algorithm is fine too.

## "Is a nondefault level set ?"

The PostgreSQL project does not use space before question mark (at least 
not in English).


## Why isn't level_set just a local variable in parse_compression()? It 
does not seem to be used elsewhere.


## Shouldn't we call the Compression variable in OpenArchive() 
nocompress to match with the naming convention in other places.


And in general I wonder if we should not write "nocompression = 
{COMPR_ALG_NONE}" rather than "nocompression = {0}".


## Why not use const on the pointers to Compression for functions like 
cfopen()? As far as I can see several of them could be const.


## Shouldn't "AH->compression.alg = Z_DEFAULT_COMPRESSION" in ReadHead() 
be "AH->compression.alg = COMPR_ALG_DEFAULT"?


Additionally I am not convinced that returning COMPR_ALG_DEFAULT will 
even work but I have not had the time to test that theory yet. And in 
general I am quite sceptical of that we really need of COMPR_ALG_DEFAULT.


## Some white space issues

Add spaces around plus in "atoi(1+eq)" and "pg_log_error("unknown 
compression algorithm: %s", 1+eq)".


Add spaces around plus in parse_compression(), e.g. in "strlen(1+eq)".

## Shouldn't hasSuffix() take the current compression algorithm as a 
parameter? Or alternatively look up which compression algorithm to use 
from the suffix?


## Why support multiple ways to write zlib on the command line? I do not 
see any advatange of being able to write it as libz.


## I feel renaming SaveOutput() to GetOutput() would make it more clear 
what it does now that you have changed the return type.


## You have accidentally committed "-runstatedir" in configure. I have 
no idea why we do not have it (maybe it is something Debian specific) 
but even if we are going to add it it should not be in this patch. Same 
with the parenthesis changes to LARGE_OFF_T.


## This is probably out of scope of your patch but I am not a fan of the 
fallback logic in cfopen_read(). I feel ideally we should always know if 
there is a suffix or not and not try to guess file names and do 
pointless syscalls.


## COMPR_ALG_DEFAULT looks like it would error out for archive and 
directory if someone has neither zlib nor zstandard. It feels like it 
should default to uncompressed if we have neither. Or at least give a 
better error message.



Note, there's currently several "compression" patches in CF app.  This patch
seems to be independent of the others, but probably shouldn't be totally
uncoordinated (like adding lz4 in one and ztsd in another might be poor
execution).


A thought here is that maybe we want to use the same values for the 
enums in all patches. Especially if we write the numeric value to pg 
dump files.


Andreas





Re: zstd compression for pg_dump

2021-01-04 Thread Daniil Zakhlystov
Hi!

> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
> 
> Daniil, is levels definition compatible with libpq compression patch?
> +typedef struct Compress {
> + CompressionAlgorithmalg;
> + int level;
> + /* Is a nondefault level set ?  This is useful since different 
> compression
> +  * methods have different "default" levels.  For now we assume the 
> levels
> +  * are all integer, though.
> + */
> + boollevel_set;
> +} Compress;

Similarly to this patch, it is also possible to define the compression level at 
the initialization stage in libpq compression patch.

The difference is that in libpq compression patch the default compression level 
always equal to 1, independently of the chosen compression algorithm.

> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
> 
> Libpq compression encountered some problems with memory consumption which 
> required some extra config efforts.


> On Jan 4, 2021, at 12:06 PM, Justin Pryzby  wrote:
> 
> RAM use is not significantly different from zlib, except that zstd --long adds
> more memory.

Regarding ZSTD memory usage:

Recently I’ve made a couple of tests of libpq compression with different 
ZLIB/ZSTD compression levels which shown that compressing/decompressing ZSTD w/ 
high compression levels 
require to allocate more virtual (Commited_AS) memory, which may be exploited 
by malicious clients:

https://www.postgresql.org/message-id/62527092-16BD-479F-B503-FA527AF3B0C2%40yandex-team.ru

We can avoid high memory usage by limiting the max window size to 8MB. This 
should effectively disable the support of compression levels above 19:
https://www.postgresql.org/message-id/6A45DFAA-1682-4EF2-B835-C5F46615EC49%40yandex-team.ru

So maybe it is worthwhile to use similar restrictions in this patch.

—
Daniil Zakhlystov



Re: zstd compression for pg_dump

2021-01-03 Thread Justin Pryzby
On Mon, Jan 04, 2021 at 11:04:57AM +0500, Andrey Borodin wrote:
> > 4 янв. 2021 г., в 07:53, Justin Pryzby  написал(а):
> > Note, there's currently several "compression" patches in CF app.  This patch
> > seems to be independent of the others, but probably shouldn't be totally
> > uncoordinated (like adding lz4 in one and ztsd in another might be poor
> > execution).
> > 
> > https://commitfest.postgresql.org/31/2897/
> > - Faster pglz compression
> > https://commitfest.postgresql.org/31/2813/
> > - custom compression methods for toast
> > https://commitfest.postgresql.org/31/2773/
> > - libpq compression
> 
> I think that's downside of our development system: patch authors do not want 
> to create dependencies on other patches.

I think in these cases, someone who notices common/overlapping patches should
suggest that the authors review each other's work.  In some cases, I think it's
appropriate to come up with a "shared" preliminary patch(es), which both (all)
patch authors can include as 0001 until its finalized and merged.  That might
be true for some things like the tableam work, or the two "online checksum"
patches.

> I'd say that both lz4 and zstd should be supported in TOAST, FPIs, libpq, and 
> pg_dump. As to pglz - I think we should not proliferate it any further.

pg_basebackup came up as another use on another thread, I think related to
libpq protocol compression.

> Libpq compression encountered some problems with memory consumption which
> required some extra config efforts. Did you measure memory usage for this
> patchset?

RAM use is not significantly different from zlib, except that zstd --long adds
more memory.

$ command time -v pg_dump -d ts -t ... -Fc -Z0 |wc -c
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.77
Maximum resident set size (kbytes): 40504
1397288924 # no compression: 1400MB

$ command time -v pg_dump -d ts -t ... -Fc |wc -c
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:37.17
Maximum resident set size (kbytes): 40504
132932415 # default (zlib) compression: 132 MB

$ command time -v ./pg_dump -d ts -t ... -Fc |wc -c
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.28
Maximum resident set size (kbytes): 40568
86048139 # zstd: 86MB

$ command time -v ./pg_dump -d ts -t ... -Fc -Z 'alg=zstd opt=zstdlong' |wc -c
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:30.49
Maximum resident set size (kbytes): 180332
72202937 # zstd long: 180MB

> [PATCH 04/20] struct compressLibs
> I think this directive would be correct.
> +// #ifdef HAVE_LIBZ?

I'm not sure .. I'm thinking of making the COMPR_ALG_* always defined, and then
fail later if an operation is unsupported.  There's an excessive number of
#ifdefs already, so the early commits are intended to minimize as far as
possible what's needed for each additional compression
algorithm(lib/method/whatever it's called).  I haven't tested much with
pg_restore of files with unsupported compression libs.

> [PATCH 06/20] pg_dump: zstd compression
> I'd propose to build with Zstd by default. It seems other patches do it this 
> way. Though, I there are possible downsides.

Yes...but the cfbot turns red if the patch require zstd, so it defaults to
off until it's included in the build environments (but for now, the main patch
isn't being tested).

Thanks for looking.

-- 
Justin




Re: zstd compression for pg_dump

2021-01-03 Thread Andrey Borodin



> 4 янв. 2021 г., в 07:53, Justin Pryzby  написал(а):
> 
> Note, there's currently several "compression" patches in CF app.  This patch
> seems to be independent of the others, but probably shouldn't be totally
> uncoordinated (like adding lz4 in one and ztsd in another might be poor
> execution).
> 
> https://commitfest.postgresql.org/31/2897/
> - Faster pglz compression
> https://commitfest.postgresql.org/31/2813/
> - custom compression methods for toast
> https://commitfest.postgresql.org/31/2773/
> - libpq compression

I think that's downside of our development system: patch authors do not want to 
create dependencies on other patches.
I'd say that both lz4 and zstd should be supported in TOAST, FPIs, libpq, and 
pg_dump. As to pglz - I think we should not proliferate it any further.
Lz4 and Zstd represent a different tradeoff actually. Basically, lz4 is so 
CPU-cheap that one should use it whenever they write to disk or network 
interface. Zstd represent an actual bandwith\CPU tradeoff.
Also, all patchsets do not touch important possibility - preexisting dictionary 
could radically improve compression of small data (event in pglz).

Some minor notes on patchset at this thread.

Libpq compression encountered some problems with memory consumption which 
required some extra config efforts. Did you measure memory usage for this 
patchset?

[PATCH 03/20] Support multiple compression algs/levels/opts..
abtracts -> abstracts
enum CompressionAlgorithm actually represent the very same thing as in "Custom 
compression methods"

Daniil, is levels definition compatible with libpq compression patch?
+typedef struct Compress {
+   CompressionAlgorithmalg;
+   int level;
+   /* Is a nondefault level set ?  This is useful since different 
compression
+* methods have different "default" levels.  For now we assume the 
levels
+* are all integer, though.
+   */
+   boollevel_set;
+} Compress;

[PATCH 04/20] struct compressLibs
I think this directive would be correct.
+// #ifdef HAVE_LIBZ?

Here's extra comment
// && errno == ENOENT)


[PATCH 06/20] pg_dump: zstd compression

I'd propose to build with Zstd by default. It seems other patches do it this 
way. Though, I there are possible downsides.


Thanks for working on this! We will have very IO-efficient Postgres :)

Best regards, Andrey Borodin.



Re: zstd compression for pg_dump

2021-01-03 Thread Justin Pryzby
On Mon, Dec 21, 2020 at 01:49:24PM -0600, Justin Pryzby wrote:
> a big disadvantage of piping through zstd is that it's not identified as a
> PGDMP file, and, /usr/bin/file on centos7 fails to even identify zstd by its
> magic number..

Other reasons are that pg_dump |zstd >output.zst loses the exit status of
pg_dump, and that it's not "transparent" (one needs to type
"zstd -dq |pg_restore").

On Mon, Dec 21, 2020 at 08:32:35PM -0600, Justin Pryzby wrote:
> On Mon, Dec 21, 2020 at 03:02:40PM -0500, Tom Lane wrote:
> > Justin Pryzby  writes:
> > > I found that our largest tables are 40% smaller and 20% faster to pipe
> > > pg_dump -Fc -Z0 |zstd relative to native zlib
> > 
> > The patch might be a tad smaller if you hadn't included a core file in it.
> 
> About 89% smaller.
> 
> This also fixes the extension (.zst)
> And fixes zlib default compression.
> And a bunch of cleanup.

I rebased so the "typedef struct compression" patch is first and zstd on top of
that (say, in case someone wants to bikeshed about which compression algorithm
to support).  And made a central struct with all the compression-specific info
to further isolate the compress-specific changes.

And handle compression of "plain" archive format.
And fix compilation for MSVC and make --without-zstd the default.

And fix cfgets() (which I think is actually unused code for the code paths for
compressed FP).

And add fix for pre-existing problem: ftello() on unseekable input.

I also started a patch to allow compression of "tar" format, but I didn't
include that here yet.

Note, there's currently several "compression" patches in CF app.  This patch
seems to be independent of the others, but probably shouldn't be totally
uncoordinated (like adding lz4 in one and ztsd in another might be poor
execution).

https://commitfest.postgresql.org/31/2897/
 - Faster pglz compression
https://commitfest.postgresql.org/31/2813/
 - custom compression methods for toast
https://commitfest.postgresql.org/31/2773/
 - libpq compression

-- 
Justin
>From d2fc2673e19a95629edfe9a0f4ead75e1f1f2754 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 23 Dec 2020 23:56:54 -0600
Subject: [PATCH 01/20] fix!preeexisting

---
 src/bin/pg_dump/compress_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/pg_dump/compress_io.c b/src/bin/pg_dump/compress_io.c
index 1417401086..6a428978d4 100644
--- a/src/bin/pg_dump/compress_io.c
+++ b/src/bin/pg_dump/compress_io.c
@@ -28,7 +28,7 @@
  *	The interface for reading an archive consists of just one function:
  *	ReadDataFromArchive. ReadDataFromArchive reads the whole compressed input
  *	stream, by repeatedly calling the given ReadFunc. ReadFunc returns the
- *	compressed data chunk at a time, and ReadDataFromArchive decompresses it
+ *	compressed data one chunk at a time, and ReadDataFromArchive decompresses it
  *	and passes the decompressed data to ahwrite(), until ReadFunc returns 0
  *	to signal EOF.
  *
-- 
2.17.0

>From e8bb61dd633aefb2cc7a14887a15cc60e05cd9c5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Dec 2020 15:40:08 -0600
Subject: [PATCH 02/20] Fix broken error message on unseekable input..

pg_dump -Fd -f dir.dir
cat dir.dir/toc.dat | pg_restore -l # I realize this isn't how it's intended to be used
pg_restore: error: corrupt tar header found in PGDMP (expected 0, computed 18577) file position 18446744073709551615

See also 929c69aa19
---
 src/bin/pg_dump/pg_backup_tar.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index 54e708875c..61c9c87a9f 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -1280,12 +1280,14 @@ _tarGetHeader(ArchiveHandle *AH, TAR_MEMBER *th)
 
 	if (chk != sum)
 	{
-		char		posbuf[32];
+		off_t		off = ftello(ctx->tarFH);
 
-		snprintf(posbuf, sizeof(posbuf), UINT64_FORMAT,
- (uint64) ftello(ctx->tarFH));
-		fatal("corrupt tar header found in %s (expected %d, computed %d) file position %s",
-			  tag, sum, chk, posbuf);
+		if (off == -1)
+			fatal("corrupt tar header found in %s (expected %d, computed %d)",
+  tag, sum, chk);
+		else
+			fatal("corrupt tar header found in %s (expected %d, computed %d) file position " UINT64_FORMAT,
+  tag, sum, chk, off);
 	}
 
 	th->targetFile = pg_strdup(tag);
-- 
2.17.0

>From bbf91d13a25bcb2f0718f40984362f81741ca66a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 22 Dec 2020 00:23:43 -0600
Subject: [PATCH 03/20] Support multiple compression algs/levels/opts..

The existing implementation abtracts compressed and noncompressed I/O.
This preliminary commit is intended to also allow for multiple compression
algorithms.
---
 src/bin/pg_dump/compress_io.c | 220 +++---
 src/bin/pg_dump/compress_io.h |  19 +--
 src/bin/pg_dump/pg_backup.h   |  23 ++-
 src/bin/pg_dump/pg_backup_archiver.c  |  45 +++---
 

Re: zstd compression for pg_dump

2020-12-21 Thread Justin Pryzby
On Mon, Dec 21, 2020 at 03:02:40PM -0500, Tom Lane wrote:
> Justin Pryzby  writes:
> > I found that our largest tables are 40% smaller and 20% faster to pipe
> > pg_dump -Fc -Z0 |zstd relative to native zlib
> 
> The patch might be a tad smaller if you hadn't included a core file in it.

About 89% smaller.

This also fixes the extension (.zst)
And fixes zlib default compression.
And a bunch of cleanup.

-- 
Justin
>From c506c8a93ae70726a5b4b33a1d0098caf5665f3a Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Mon, 21 Dec 2020 00:32:32 -0600
Subject: [PATCH 1/7] fix pre-existing docs/comments

---
 doc/src/sgml/ref/pg_dump.sgml | 2 +-
 src/bin/pg_dump/pg_backup_directory.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0aa35cf0c3..dcb25dc3cd 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -621,7 +621,7 @@ PostgreSQL documentation
   

 Specify the compression level to use.  Zero means no compression.
-For the custom archive format, this specifies compression of
+For the custom and directory archive formats, this specifies compression of
 individual table-data segments, and the default is to compress
 at a moderate level.
 For plain text output, setting a nonzero compression level causes
diff --git a/src/bin/pg_dump/pg_backup_directory.c b/src/bin/pg_dump/pg_backup_directory.c
index 48fa7cb1a3..650b542fce 100644
--- a/src/bin/pg_dump/pg_backup_directory.c
+++ b/src/bin/pg_dump/pg_backup_directory.c
@@ -4,7 +4,7 @@
  *
  *	A directory format dump is a directory, which contains a "toc.dat" file
  *	for the TOC, and a separate file for each data entry, named ".dat".
- *	Large objects (BLOBs) are stored in separate files named "blob_.dat",
+ *	Large objects (BLOBs) are stored in separate files named "blob_.dat",
  *	and there's a plain-text TOC file for them called "blobs.toc". If
  *	compression is used, each data file is individually compressed and the
  *	".gz" suffix is added to the filenames. The TOC files are never
-- 
2.17.0

>From 49f400b5b2e559cc2ed6be82e04a4dc4ba5c63b3 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 12 Dec 2020 00:11:37 -0600
Subject: [PATCH 2/7] Fix malformed comment..

broken since bf50caf10.
---
 src/bin/pg_dump/pg_backup_archiver.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/pg_backup_archiver.h b/src/bin/pg_dump/pg_backup_archiver.h
index 177360ed6e..6a5a22637b 100644
--- a/src/bin/pg_dump/pg_backup_archiver.h
+++ b/src/bin/pg_dump/pg_backup_archiver.h
@@ -329,9 +329,12 @@ struct _archiveHandle
 	DumpId	   *tableDataId;	/* TABLE DATA ids, indexed by table dumpId */
 
 	struct _tocEntry *currToc;	/* Used when dumping data */
-	int			compression;	/* Compression requested on open Possible
- * values for compression: -1
- * Z_DEFAULT_COMPRESSION 0	COMPRESSION_NONE
+	int			compression;	/*-
+ * Compression requested on open
+ * Possible values for compression:
+ * -2	ZSTD_COMPRESSION
+ * -1	Z_DEFAULT_COMPRESSION
+ *  0	COMPRESSION_NONE
  * 1-9 levels for gzip compression */
 	bool		dosync;			/* data requested to be synced on sight */
 	ArchiveMode mode;			/* File mode - r or w */
-- 
2.17.0

>From 8a8939a3060c984af289b5fe62754d94f2675248 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Fri, 11 Dec 2020 15:01:25 -0600
Subject: [PATCH 3/7] pg_dump: zstd compression

document any change in search for .gz?
docs
---
 configure | 123 ++-
 configure.ac  |  22 ++
 src/bin/pg_dump/compress_io.c | 475 --
 src/bin/pg_dump/compress_io.h |   5 +-
 src/bin/pg_dump/pg_backup_archiver.h  |   5 +
 src/bin/pg_dump/pg_backup_directory.c |   6 +-
 src/bin/pg_dump/pg_dump.c |   3 +-
 src/include/pg_config.h.in|   3 +
 8 files changed, 597 insertions(+), 45 deletions(-)

diff --git a/configure b/configure
index 11a4284e5b..240e536e04 100755
--- a/configure
+++ b/configure
@@ -698,6 +698,7 @@ with_gnu_ld
 LD
 LDFLAGS_SL
 LDFLAGS_EX
+with_zstd
 with_zlib
 with_system_tzdata
 with_libxslt
@@ -798,6 +799,7 @@ infodir
 docdir
 oldincludedir
 includedir
+runstatedir
 localstatedir
 sharedstatedir
 sysconfdir
@@ -866,6 +868,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_zstd
 with_gnu_ld
 enable_largefile
 '
@@ -935,6 +938,7 @@ datadir='${datarootdir}'
 sysconfdir='${prefix}/etc'
 sharedstatedir='${prefix}/com'
 localstatedir='${prefix}/var'
+runstatedir='${localstatedir}/run'
 includedir='${prefix}/include'
 oldincludedir='/usr/include'
 docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -1187,6 +1191,15 @@ do
   | -silent | --silent | --silen | --sile | --sil)
 silent=yes ;;
 
+  -runstatedir | --runstatedir | --runstatedi | --runstated \
+  | 

Re: zstd compression for pg_dump

2020-12-21 Thread Tom Lane
Justin Pryzby  writes:
> I found that our largest tables are 40% smaller and 20% faster to pipe
> pg_dump -Fc -Z0 |zstd relative to native zlib

The patch might be a tad smaller if you hadn't included a core file in it.

regards, tom lane