Re: creating a subset DB efficiently ?

Shaheed Haque Sat, 09 Mar 2024 10:45:19 -0800

I go more or less the other way. I have a script which follows the
child...parent...grandparent...etc foreign keys in the source database and
dumps only the records belonging to the selected "project" (your
terminology, in my case it is "client"). I save the dumped data to an
archive file.


The script has hardcoded knowledge only about the absolute minimum number
of the root ancestor tables (and certain special cases mostly relating to
the fact I'm talking about a Django ORM-managed schema) and otherwise tries
to avoid knowing too much so that the schema can evolve mostly without
bothering the script.

The script includes meta data about the dump in a "manifest". The manifest
records, amongst other things, the order in which the tables were dumped.
The restore operation uses this to correctly order the table restores.

I can then simply restore the archive, table by table, using the individual
dumps using a script which walks the manifest.






On Sat, 9 Mar 2024, 14:56 hector vass, <hector.v...@gmail.com> wrote:

> On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostg...@gmail.com>
> wrote:
>
>> Here's the situation....
>>
>> - The DB contains data for several projects.
>> - The tables of the DB contain data for all projects (data is not
>> partitioned on project name or anything like that)
>> - The "project" identifier (table column) exists in a few "parent" tables
>> with many child... grandchild,... tables under them connected with foreign
>> keys defined with "on delete cascade".  So if a record in one of the parent
>> table records is deleted, all of its underlying, dependent records get
>> deleted too.
>> - New projects come in, and old ones need to be removed and "archived" in
>> DBs of their own.  So there's a DB called "active_projects" and there's a
>> DB called "project_a_archive" (identical metadata).
>> - The idea is to copy the data for project "a" that's in
>> "active_projects" to the "project_a_arhchive" DB AND delete the project a
>> data out of "active_projects".
>> - Leave "project_a_archive" up and running if someone needs to attach to
>> that and get some old/archived data.
>>
>> The brute-force method I've been using is...
>> 1)  pg_dump "active_projects" to a (huge) file then populate
>> "project_a_archive" using that (I don't have the privs to create database,
>> IT creates an empty one for me, so this is how I do it).
>> 2) go into the "project_a_archive" DB and run... "delete from par_tbl_1
>> where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ",
>> etc... leaving only project "a" data in the DB.
>> 3) go into the "active_projects" DB and "delete from par_tbl_1 where
>> project = 'a' ", etc... removing project "a" from the "active_projects DB.
>>
>> Ya, not very elegant, it takes a long time and it takes a lot of
>> resources.  So I'm looking for ideas on how to do this better.
>>
>> Related question...
>> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I
>> fear it's because it's trying to journal everything in case I want to
>> rollback.  But this is just in the archive DB and I don't mind taking the
>> risk if I can speed this up outside of a transaction.  How can I run a
>> delete command like this without the rollback recovery overhead ?
>>
>
>
> >(I don't have the privs to create database, IT creates an empty one for
> me, so this is how I do it).
>
> That's a shame.  You can do something similar with tablespaces
>   Template your existing schema to create a new schema for the project
> (pg_dump -s)
>   Create tablespace for this new project and schema
>
>  You can then move the physical tablespace to cheaper disk and use
> symbolic links or... archive and/or back it up at the schema level with
> pg_dump -n
>
> ...as long as you don't put anything in the public schema all you are
> really sharing is roles otherwise a bit like a separate database
>
>
>
>
>

Re: creating a subset DB efficiently ?

Reply via email to