All,

It's perhaps not the ideal time for a discussion but if I thought it would
turn into a long discussion then I'd probably not post this due to the
current timing in the release cycle.
This is something I thought of while doing a restore on a 40ish GB database
which has a few hundred smallish tables of various sizes up to about 1.5
million records, then a handful of larger tables containing 20-70 million
records.

During the restore (which was running 4 separate jobs), I was polling SELECT
query FROM pg_Stat_activity to find out the progress of the restore.  I
noticed that there was now less than 4 jobs running and pg_restore was busy
doing COPY into some of the 20-70 million record tables. 

If pg_dump was to still follow the dependencies of objects, would there be
any reason why it shouldn't backup larger tables first? This should then
allow pg_restore to balance the smaller tables around separate jobs at the
end of the restore instead of having CPUs sitting idle while say 1 job is
busy on a big table.

Of course this would not improve things for all work loads, but I hardly
think that a database with a high number of smallish tables and a small
number of large tables is unusual.

If there was consensus that it might be a good idea to craft up a patch to
test if this is worth it then I'd be willing to give it a go.

Some of the things I thought about but did not have an answer for:
        1. Would it be enough just check the number of blocks in each
relation or would it be better to look at the statistics to estimate the
size of the when it's restored minus the dead tuples.
        2. Would it be a good idea to add an extra pg_dump option for this
or just make it the default for all dumps that contain table data?


Any thoughts on this are welcome.

Regards

David Rowley



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to