Re: [GENERAL] data import - duplicates

brian Sun, 09 Mar 2008 12:55:49 -0700


Webb Sprague wrote:

I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:

begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT

wsprague=# select * from fbar;
 x  | y
----+---
  1 | 1
  2 | 2
  3 | 3
  4 | 0
  5 | 1
  6 | 2
  7 | 3
  8 | 0
  9 | 1
 10 | 2
(10 rows)

-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;

wsprague=# select * from baz;
 y |  x_list  | x_len
---+----------+-------
 3 | {3,7}    |     2
 2 | {2,6,10} |     3
 1 | {1,5,9}  |     3
 0 | {4,8}    |     2
(4 rows)

-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
 x | y
---+---
 1 | 1
 2 | 2
 3 | 3
 4 | 0
(4 rows)

commit;

I hope that makes sense.  It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes.  You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from  transaction isolation of
the whole process....



Hey, neat idea. I changed things a little:

test=# SELECT * FROM table1;
 id | ck | content
----+----+---------
  1 |  1 | string1
  2 |  2 | string2
  3 |  2 | string3
  4 |  4 | string1
  5 |  5 | string1
  6 |  1 | string3
  7 |  2 | string1
(7 rows)

test=# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (

test-# SELECT array_accum(ck) AS key_list, content FROM table1 GROUP BYcontent

test=# ) a;
SELECT

test=# SELECT * FROM table2;
 key_list  | key_num | content
-----------+---------+---------
 {2,1}     |       2 | string3
 {1,4,5,2} |       4 | string1
 {2}       |       1 | string2
(3 rows)

I had to modify your DELETE suggestion a bit because the table1.id (yourx) is non-unique (although it must be within a given key_list). It'll bea performance hog though because it's comparing each string.



test-# DELETE FROM table1 WHERE NOT EXISTS (
test-# SELECT 1 FROM table2
test-# WHERE table1.content = table2.content
test-# AND table1.ck = table2.key_list[1]
test-# );
DELETE 4

test=# SELECT * FROM table1;
 id | ck | content
----+----+---------
  1 |  1 | string1
  2 |  2 | string2
  3 |  2 | string3
(3 rows)

From here I can drop the table1.ck column. But, in order to place thetable2.key_list ints into a join table, I need the PK from table1. Ihave the content column with which to fetch that from table1 but thathardly seems an efficient method. So, I modified your idea for gatheringthe dupe CKs a little:


test-# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (

test-# SELECT array_cat(ARRAY[array_accum(id)], ARRAY[array_accum(ck)])AS key_list, content

test-# FROM table1 GROUP BY content) a;
SELECT

test=# select * from table2;
       key_list        | key_num | content
-----------------------+---------+---------
 {{3,6},{2,1}}         |       2 | string3
 {{1,4,5,7},{1,4,5,2}} |       2 | string1
 {{2},{2}}             |       2 | string2
(3 rows)

Huzzah! IOW, I accumulate both the PKs and ck from table1 for eachstring. Then, I can ditch all but the first PK, which should maketransferring these values in to the join table a snap.


# lightbulb comes on ...

Actually, what I'll do is remove the extraneous PKs and use them toDELETE FROM table1. Much more efficient, that.

Thanks for your ideas. And if anyone else has any suggestion please feelfree to chime in. When I get the rest of the data I'll post back here onhow it went in case google points someone here in the future.


b

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] data import - duplicates

Reply via email to