Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Joshua D. Drake ([EMAIL PROTECTED]) wrote:
> Custom format rocks for partial set restores from a whole dump. See the  
> TOC option :)

I imagine it does, but that's very rarely what I need.  Most of the time
we're dumping out a schema to load it into a seperate schema (usually on
another host).  Sometimes that can be done by simply vi'ing the file to
change the search_path and whatnot, though more often we end up pipe'ing
the whole thing through sed.  Since we don't allow regular users to do
much, and you have to 'set role postgres;' to do anything as superuser,
we also often end up adding 'set role postgres;' to the top of the .sql
files.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Stephen Frost wrote:

* Tom Lane ([EMAIL PROTECTED]) wrote:

Stephen Frost <[EMAIL PROTECTED]> writes:

I dislike, and doubt that I'd use, this approach.  At the end of the
day, it ends up processing the same (very large amount of data) multiple
times.



This would depend on the dump being in the custom format, though I
suppose that ends up being true for any usage of these options.  I've
never really been a fan of the custom format, in large part because it
doesn't really buy you all that much and makes changing things more
difficult (by having to extract out what you want to change, and then
omit it from the restore).


Custom format rocks for partial set restores from a whole dump. See the 
TOC option :)


Joshua D. Drake

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore.  So I think it's pretty
> silly to argue that no one will ever want this feature to work in
> pg_restore.

I think you've about convinced me on this, and it annoys me. ;)  Worse
is that it sounds like this might cause the options to not make it in
for 8.4, which would be quite frustrating.

> To extend the example I just gave to Stephen, I think a fairly probable
> scenario is where you only need to tweak some "before" object
> definitions, and then you could do
> 
> pg_restore --schema-before-data whole.dump >before.sql
> edit before.sql
> psql -f before.sql target_db
> pg_restore --data-only --schema-after-data -d target_db whole.dump
> 
> which (given a parallelizing pg_restore) would do all the time-consuming
> steps in a fully parallelized fashion.

Alright, this has been mulling around in the back of my head a bit and
has now finally surfaced- I like having the whole dump contained in a
single file, but I hate having what ends up being "out-dated" or "wrong"
or "not what was loaded" in the dump file.  Doesn't seem likely to be
possible, but it'd be neat to be able to modify objects in the dump
file.

Also, something which often happens to me is that I need to change the
search_path or the role at the top of a .sql from pg_dump before
restoring it.  Seems like using the custom format would make that
difficult without some pipe/cat/sed magic.  Parallelization would make
using that kind of magic more difficult too, I would guess.  Might be
something to think about.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Stephen Frost
* Tom Lane ([EMAIL PROTECTED]) wrote:
> Stephen Frost <[EMAIL PROTECTED]> writes:
> > I dislike, and doubt that I'd use, this approach.  At the end of the
> > day, it ends up processing the same (very large amount of data) multiple
> > times.
> 
> Well, that's easily avoided: just replace the third step by restoring
> directly to the target database.
> 
> pg_restore --schema-before-data whole.dump >before.sql
> edit before.sql
> pg_restore --schema-after-data whole.dump >after.sql
> edit after.sql
> psql -f before.sql target_db
> pg_restore --data-only -d target_db whole.dump
> psql -f after.sql target_db

This would depend on the dump being in the custom format, though I
suppose that ends up being true for any usage of these options.  I've
never really been a fan of the custom format, in large part because it
doesn't really buy you all that much and makes changing things more
difficult (by having to extract out what you want to change, and then
omit it from the restore).

I can see some advantage to having the entire dump contained in a single
file and still being able to pull out pieces based on before/after.
Should we get a binary format which is much faster, I could see myself
being more likely to use pg_restore.  Same for parallelization or, in my
fantasies, the ability to copy schema, tables, indexes, etc, in 'raw' PG
format between servers.  Worse than having to vi an insanely large file,
or split it up to be able to modify the pieces you want, is having to
rebuild indexes, especially GIST ones.  That's another topic though.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Andrew Dunstan wrote:



Joshua D. Drake wrote:


Agreed but that is a problem I understand with a solution I don't. I 
am all eyes on a way to fix that. One thought I had and please, be 
gentle in response was some sort of async transaction capability. I 
know that libpq has the ability to send async queries. Is it possible 
to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?




IIRC, libpq doesn't let you have more than one async query active at one 
time.


Now that I think on it harder, this isn't even a libpq problem (although 
its involved), we need the postmaster do be able to do a background 
async query. Which is (I am guessing) why libpq can only do one at a time.


Sincerely,

Joshua D. Drake



--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Andrew Dunstan



Joshua D. Drake wrote:


Agreed but that is a problem I understand with a solution I don't. I 
am all eyes on a way to fix that. One thought I had and please, be 
gentle in response was some sort of async transaction capability. I 
know that libpq has the ability to send async queries. Is it possible 
to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?




IIRC, libpq doesn't let you have more than one async query active at one 
time.


cheers

andrew

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Simon Riggs

On Sat, 2008-07-26 at 11:03 -0700, Joshua D. Drake wrote:

> 2. We have no concurrency which means, anyone with any database over 50G
> has unacceptable restore times.

Agreed.

Also the core reason for wanting -w

> 3. We have to continue develop hacks to define custom utilization. Why
> am I passing pre-data anything? It should be automatic. For example:
> 
> pg_backup (not dump, we aren't dumping. Dumping is usually associated
> with some sort of crash or fould human behavoir. We are backing up).
>pg_backup -U  -D database -F -f mybackup.sqlc
> 
> If I were to extract  I would get:
> 
>   mybackup.datatypes
>   mybackup.tables
>   mybackup.data
>   mybackup.primary_keys
>   mybackup.indexes
>   mybackup.constraints
>   mybackup.grants

Sounds good.

Doesn't help with the main element of dump time: one table at a time to
one output file. We need a way to dump multiple tables concurrently,
ending in multiple files/filesystems.

> Oh and pg_dumpall? It should have been removed right around the release
> of 7.2, pg_dump -A please.

Good idea

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Simon Riggs

On Sat, 2008-07-26 at 13:56 -0400, Tom Lane wrote:
> Simon Riggs <[EMAIL PROTECTED]> writes:
> > I want to dump tables separately for performance reasons. There are
> > documented tests showing 100% gains using this method. There is no gain
> > adding this to pg_restore. There is a gain to be had - parallelising
> > index creation, but this patch doesn't provide parallelisation.
> 
> Right, but the parallelization is going to happen sometime, and it is
> going to happen in the context of pg_restore. 

I honestly think there is less benefit that way than if we consider
things more as a whole:

To do data dump quickly we need to dump different tables to different
disks simultaneously. By its very nature, that cannot end with just a
single file. So the starting point for any restore must be potentially
more than one file.

There are two ways of dumping: either multi-thread pg_dump, or allow
multiple pg_dumps to work together. Second option much less work, same
result. (Either way we also need a way for multiple concurrent sessions
to share a snapshot.)

When restoring, we can then just use multiple pg_restore sessions to
restore the individual data files. Or again we can write a
multi-threaded pg_restore to do the same thing - why would I bother
doing that when I already can? It gains us nothing.

Parallelising the index creation seems best done using concurrent psql.
We've agreed some mods to psql to put multi-sessions in there. If we do
that right, then we can make pg_restore generate a psql script with
multi-session commands scattered appropriately throughout.

Parallel pg_restore is a lot of work for a narrow use case. Concurrent
psql provides a much wider set of use cases.

So fully parallelising dump/restore can be achieved by

* splitting dump into pieces (this patch)
* allowing sessions to share a common snapshot
* concurrent psql
* changes to pg_restore/psql/pg_dump to allow commands to be inserted
which will use concurrent psql features

If we do things this way then we have some useful tools that can be used
in a range of use cases, not just restore.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


-- 
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches


Re: [PATCHES] pg_dump additional options for performance

2008-07-27 Thread Joshua D. Drake

Simon Riggs wrote:

On Sat, 2008-07-26 at 11:03 -0700, Joshua D. Drake wrote:


2. We have no concurrency which means, anyone with any database over 50G
has unacceptable restore times.


Agreed.



Sounds good.

Doesn't help with the main element of dump time: one table at a time to
one output file. We need a way to dump multiple tables concurrently,
ending in multiple files/filesystems.


Agreed but that is a problem I understand with a solution I don't. I am 
all eyes on a way to fix that. One thought I had and please, be gentle 
in response was some sort of async transaction capability. I know that 
libpq has the ability to send async queries. Is it possible to do this:


send async(copy table to foo)
send async(copy table to bar)
send async(copy table to baz)

Where all three copies are happening in the background?

Sincerely,

Joshua D. Drake


--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches