Re: creating a subset DB efficiently ?

2024-03-09 Thread Peter J. Holzer
On 2024-03-08 11:22:17 -0500, David Gauthier wrote:
> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.

Have you checked whether all the FK columns have an index? Otherwise
checking for references means a full table scan which may take a long
time.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: creating a subset DB efficiently ?

2024-03-09 Thread Shaheed Haque
I go more or less the other way. I have a script which follows the
child...parent...grandparent...etc foreign keys in the source database and
dumps only the records belonging to the selected "project" (your
terminology, in my case it is "client"). I save the dumped data to an
archive file.

The script has hardcoded knowledge only about the absolute minimum number
of the root ancestor tables (and certain special cases mostly relating to
the fact I'm talking about a Django ORM-managed schema) and otherwise tries
to avoid knowing too much so that the schema can evolve mostly without
bothering the script.

The script includes meta data about the dump in a "manifest". The manifest
records, amongst other things, the order in which the tables were dumped.
The restore operation uses this to correctly order the table restores.

I can then simply restore the archive, table by table, using the individual
dumps using a script which walks the manifest.






On Sat, 9 Mar 2024, 14:56 hector vass,  wrote:

> On Fri, Mar 8, 2024 at 4:22 PM David Gauthier 
> wrote:
>
>> Here's the situation
>>
>> - The DB contains data for several projects.
>> - The tables of the DB contain data for all projects (data is not
>> partitioned on project name or anything like that)
>> - The "project" identifier (table column) exists in a few "parent" tables
>> with many child... grandchild,... tables under them connected with foreign
>> keys defined with "on delete cascade".  So if a record in one of the parent
>> table records is deleted, all of its underlying, dependent records get
>> deleted too.
>> - New projects come in, and old ones need to be removed and "archived" in
>> DBs of their own.  So there's a DB called "active_projects" and there's a
>> DB called "project_a_archive" (identical metadata).
>> - The idea is to copy the data for project "a" that's in
>> "active_projects" to the "project_a_arhchive" DB AND delete the project a
>> data out of "active_projects".
>> - Leave "project_a_archive" up and running if someone needs to attach to
>> that and get some old/archived data.
>>
>> The brute-force method I've been using is...
>> 1)  pg_dump "active_projects" to a (huge) file then populate
>> "project_a_archive" using that (I don't have the privs to create database,
>> IT creates an empty one for me, so this is how I do it).
>> 2) go into the "project_a_archive" DB and run... "delete from par_tbl_1
>> where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ",
>> etc... leaving only project "a" data in the DB.
>> 3) go into the "active_projects" DB and "delete from par_tbl_1 where
>> project = 'a' ", etc... removing project "a" from the "active_projects DB.
>>
>> Ya, not very elegant, it takes a long time and it takes a lot of
>> resources.  So I'm looking for ideas on how to do this better.
>>
>> Related question...
>> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I
>> fear it's because it's trying to journal everything in case I want to
>> rollback.  But this is just in the archive DB and I don't mind taking the
>> risk if I can speed this up outside of a transaction.  How can I run a
>> delete command like this without the rollback recovery overhead ?
>>
>
>
> >(I don't have the privs to create database, IT creates an empty one for
> me, so this is how I do it).
>
> That's a shame.  You can do something similar with tablespaces
>   Template your existing schema to create a new schema for the project
> (pg_dump -s)
>   Create tablespace for this new project and schema
>
>  You can then move the physical tablespace to cheaper disk and use
> symbolic links or... archive and/or back it up at the schema level with
> pg_dump -n
>
> ...as long as you don't put anything in the public schema all you are
> really sharing is roles otherwise a bit like a separate database
>
>
>
>
>


Re: creating a subset DB efficiently ?

2024-03-09 Thread Adrian Klaver

On 3/8/24 08:22, David Gauthier wrote:

Here's the situation

- The DB contains data for several projects.
- The tables of the DB contain data for all projects (data is not 
partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" 
tables with many child... grandchild,... tables under them connected 
with foreign keys defined with "on delete cascade".  So if a record in 
one of the parent table records is deleted, all of its underlying, 
dependent records get deleted too.


How many "... child... grandchild,... tables" ?

Do these tables constitute all the tables in the database?



Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I 
fear it's because it's trying to journal everything in case I want to 
rollback.  But this is just in the archive DB and I don't mind taking 
the risk if I can speed this up outside of a transaction.  How can I run 
a delete command like this without the rollback recovery overhead ?


I am assuming that at the point you do  "delete from par_tbl_a where 
project <> 'a' " project a is no longer receiving data and its records 
are static. Further assuming there is a PK that you could order by, then 
it would seem the way to go would be to delete in batches as determined 
by the PK.



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: creating a subset DB efficiently ?

2024-03-09 Thread hector vass
On Fri, Mar 8, 2024 at 4:22 PM David Gauthier  wrote:

> Here's the situation
>
> - The DB contains data for several projects.
> - The tables of the DB contain data for all projects (data is not
> partitioned on project name or anything like that)
> - The "project" identifier (table column) exists in a few "parent" tables
> with many child... grandchild,... tables under them connected with foreign
> keys defined with "on delete cascade".  So if a record in one of the parent
> table records is deleted, all of its underlying, dependent records get
> deleted too.
> - New projects come in, and old ones need to be removed and "archived" in
> DBs of their own.  So there's a DB called "active_projects" and there's a
> DB called "project_a_archive" (identical metadata).
> - The idea is to copy the data for project "a" that's in "active_projects"
> to the "project_a_arhchive" DB AND delete the project a data out of
> "active_projects".
> - Leave "project_a_archive" up and running if someone needs to attach to
> that and get some old/archived data.
>
> The brute-force method I've been using is...
> 1)  pg_dump "active_projects" to a (huge) file then populate
> "project_a_archive" using that (I don't have the privs to create database,
> IT creates an empty one for me, so this is how I do it).
> 2) go into the "project_a_archive" DB and run... "delete from par_tbl_1
> where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ",
> etc... leaving only project "a" data in the DB.
> 3) go into the "active_projects" DB and "delete from par_tbl_1 where
> project = 'a' ", etc... removing project "a" from the "active_projects DB.
>
> Ya, not very elegant, it takes a long time and it takes a lot of
> resources.  So I'm looking for ideas on how to do this better.
>
> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I
> fear it's because it's trying to journal everything in case I want to
> rollback.  But this is just in the archive DB and I don't mind taking the
> risk if I can speed this up outside of a transaction.  How can I run a
> delete command like this without the rollback recovery overhead ?
>


>(I don't have the privs to create database, IT creates an empty one for
me, so this is how I do it).

That's a shame.  You can do something similar with tablespaces
  Template your existing schema to create a new schema for the project
(pg_dump -s)
  Create tablespace for this new project and schema

 You can then move the physical tablespace to cheaper disk and use symbolic
links or... archive and/or back it up at the schema level with pg_dump -n

...as long as you don't put anything in the public schema all you are
really sharing is roles otherwise a bit like a separate database


Re: creating a subset DB efficiently ?

2024-03-08 Thread Ron Johnson
On Fri, Mar 8, 2024 at 11:22 AM David Gauthier 
wrote:

> Here's the situation
>
> - The DB contains data for several projects.
> - The tables of the DB contain data for all projects (data is not
> partitioned on project name or anything like that)
> - The "project" identifier (table column) exists in a few "parent" tables
> with many child... grandchild,... tables under them connected with foreign
> keys defined with "on delete cascade".  So if a record in one of the parent
> table records is deleted, all of its underlying, dependent records get
> deleted too.
> - New projects come in, and old ones need to be removed and "archived" in
> DBs of their own.  So there's a DB called "active_projects" and there's a
> DB called "project_a_archive" (identical metadata).
> - The idea is to copy the data for project "a" that's in "active_projects"
> to the "project_a_arhchive" DB AND delete the project a data out of
> "active_projects".
> - Leave "project_a_archive" up and running if someone needs to attach to
> that and get some old/archived data.
>
> The brute-force method I've been using is...
> 1)  pg_dump "active_projects" to a (huge) file then populate
> "project_a_archive" using that (I don't have the privs to create database,
> IT creates an empty one for me, so this is how I do it).
> 2) go into the "project_a_archive" DB and run... "delete from par_tbl_1
> where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ",
> etc... leaving only project "a" data in the DB.
> 3) go into the "active_projects" DB and "delete from par_tbl_1 where
> project = 'a' ", etc... removing project "a" from the "active_projects DB.
>
> Ya, not very elegant, it takes a long time and it takes a lot of
> resources.  So I'm looking for ideas on how to do this better.
>
>
I wrote this function to list all of the FK "children" that reference a
"parent" table.  It might help you determine all of the children,
grandchildren, etc.

CREATE OR REPLACE FUNCTION dba.get_fk_referenced_by_child(_p_table_name
TEXT)
RETURNS TABLE (downstream_table TEXT, downstream_column NAME,
downstream_index NAME)
LANGUAGE plpgsql
AS
$$
DECLARE
_v_name TEXT;
_v_downstream_index_name NAME;
_c_downstream CURSOR FOR
select
cla2.relnamespace::regnamespace::text||'.'||co.conrelid::regclass::text AS
downstream_table
, co.conrelid AS downstream_oid
, att.attname AS downstream_column
from pg_constraint co
inner join pg_class cla1
on co.confrelid = cla1.oid
inner join pg_class cla2
on co.conrelid = cla2.oid
inner join pg_attribute att
on co.conrelid = att.attrelid
where co.contype = 'f'
and
cla1.relnamespace::regnamespace::text||'.'||cla1.relname = _v_name
and att.attnum = co.conkey[1]
order by 1;
BEGIN
_v_name := TRIM(BOTH FROM _p_table_name);
PERFORM relname
FROM pg_class
WHERE relnamespace::regnamespace::text||'.'||relname = _v_name
  AND relkind = 'r';
IF NOT FOUND THEN
downstream_table := _v_name;
downstream_column := 'does not exist';
downstream_index := NULL;
RETURN NEXT;
RETURN;
END IF;
FOR i IN _c_downstream
LOOP
downstream_table := i.downstream_table;
downstream_column := i.downstream_column;
SELECT indexrelid::regclass
into downstream_index
FROM pg_index
INNER JOIN pg_attribute ON indexrelid = attrelid
WHERE indrelid = i.downstream_oid
  AND attname = i.downstream_column;
IF downstream_index IS NULL THEN
downstream_index := 'ALERT!!! MISSING INDEX';
END IF;
RETURN NEXT;
END LOOP;
END;
$$;



> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I
> fear it's because it's trying to journal everything in case I want to
> rollback.  But this is just in the archive DB and I don't mind taking the
> risk if I can speed this up outside of a transaction.  How can I run a
> delete command like this without the rollback recovery overhead ?
>