Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

Heikki Linnakangas Thu, 20 Jan 2011 07:22:30 -0800

On 20.01.2011 15:46, Joachim Wieland wrote:

On Thu, Jan 20, 2011 at 6:07 AM, Heikki Linnakangas
<heikki.linnakan...@enterprisedb.com>  wrote:

The header is there to identify a file, it contains the header that
every other pgdump file contains, including the internal version
number and the unique backup id.


The tar format doesn't support compression so going from one to the
other would only work for an uncompressed archive and special care
must be taken to get the order of the tar file right.


Hmm, tar format doesn't support compression, but looks like the file format
issue has been thought of already: there's still code there to add .gz
suffix for compressed files. How about adopting that convention in the
directory format too? That would make an uncompressed directory format
compatible with the tar format.


So what you could do is dump in the tar format, untar and restore in
the directory format. I see that this sounds nice but still I am not
sure why someone would dump to the tar format in the first place.

I'm not sure either. Maybe you want to pipe the output of "pg_dump -F t"via an ssh tunnel to another host, where you untar it, producing adirectory format dump. You can then edit the directory format dump, andrestore it back to the database without having to tar it again.

It gives you a lot of flexibility if the formats are compatible, whichis generally good.

But you still cannot go back from the directory archive to the tar
archive because the standard command line tar will not respect the
order of the objects that pg_restore expects in a tar format, right?

Hmm, I didn't realize pg_restore requires the files to be in certainorder in the tar file. There's no mention of that in the docs either, weshould add that. It doesn't actually require that if you read from afile, but from stdin it does.

You can put files in the archive in a certain order if you list themexplicitly in the tar command line, like "tar cf backup.tar toc.dat...". It's hard to know the right order, though. In practice you wouldneed to do "tar tf backup.tar >files" before untarring, and use "files"to tar them again in the rightorder.

That seems pretty attractive anyway, because you can then dump to a
directory, and manually gzip the data files later.


The command line gzip will probably add its own header to the file
that pg_restore would need to strip off...

Yeah, we should write the header too. That's not hard, e.g gzopen willdo that automatically, or you can pass a flag to deflateInit2.

A tar archive has the advantage that you can postprocess the dump data
with other tools  but for this we could also add an option that gives
you only the data part of a dump file (and uncompresses it at the same
time if compressed). Once we have that however, the question is what
anybody would then still want to use the tar format for...


I don't know how popular it'll be in practice, but it seems very nice to me
if you can do things like parallel pg_dump in directory format first, and
then tar it up to a file for archival.


Yes, but you cannot pg_restore the archive then if it was created with
standard tar, right?

See above, you can unless you try to pipe it to pg_restore. In fact,that's listed as an advantage of the tar format over other formats inthe pg_dump documentation.


(I'm working on this, no need to submit a new patch)

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump directory archive format / parallel pg_dump

Reply via email to