Well About best approach with large datasets - rarely there is "always true" best principle...
You will always see there are a few ways - best one just test confirms - depends on many things like hardware os etc... Sometimes even depends on dataset for update... " CREATE TEMP TABLE tmp AS SELECT * FROM foo; TRUNCATE foo; INSERT INTO foo SELECT * FROM bar UNION SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar); " Above doesnt amke sense to me.. I would do: CREATE TEMP TABLE tmp AS SELECT * FROM foo where 1=2; COPY tmp FROM 'pathtofile'; UPDATE foo SET foo.col1 = tmp.col1, . . . SET foo.col15 = tmp.col15 FROM tmp WHERE foo.id = tmp.id; In case I know I need just update... If in my dataset I have mix for update and potentially new rows Instead of update command, I would do DELETE FROM foo WHERE EXISTS (SELECT 1 FROM tmp WHERE tmp.id = foo.id); INSERT INTO foo SELECT * FROM tmp; On Saturday, April 27, 2013, Yang Zhang wrote: > On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic > <misa.si...@gmail.com<javascript:;>> > wrote: > > I dont know - u can test :) > > I probably will, but I do have a huge stack of such experiments to run > by now, and it's always tricky / takes care to get benchmarks right, > avoid disk caches, etc. Certainly I think it would be helpful (or at > least hopefully not harmful) to ask here to see if anyone might just > know. That's what brought me to this list. :) > > > > > In whole solution it is just one command different - so easy to test and > > compare... > > > > To me it doesnt sound as faster... Sounds as more operation needed what > > should be done... > > > > And produce more problems...i.e what with table foo? What if another > table > > refference foo etc... > > Yep, I guess more specifically I was just thinking of dumping to a temp > table: > > CREATE TEMP TABLE tmp AS > SELECT * FROM foo; > > TRUNCATE foo; > > INSERT INTO foo > SELECT * FROM bar > UNION > SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar); > > The question I have remaining is whether the bulk UPDATE will be able > to update many rows efficiently (smartly order them to do largely > sequential scans) - if so, I imagine it would be faster than the > above. > > > > > On Saturday, April 27, 2013, Yang Zhang wrote: > >> > >> On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic <misa.si...@gmail.com> > wrote: > >> > Hi, > >> > > >> > If dataset for update is large... > >> > > >> > Maybe best would be: > >> > > >> > From client machine, instead of sending update statements with data - > >> > export > >> > data to file ready for copy command > >> > Transfer file to the server where pg is running > >> > Make pgsql function which > >> > > >> > Create temp table > >> > Copy to temp from the file > >> > > >> > Update original table with values in temp > >> > > >> > UPDATE foo > >> > SET foo.col1 = bar.col1 > >> > FROM bar > >> > WHERE foo.id = bar.id > >> > > >> > You dont need to do delete/insert - if you have just update > comands.... > >> > > >> > From client when file is transfered - call your import function on the > >> > the > >> > server > >> > > >> > Optionaly you can run vacuum analyze after bulk operation... > >> > >> But wouldn't a bulk UPDATE touch many existing pages (say, 20% > >> scattered around) to mark rows as dead (per MVCC)? I guess it comes > >> down to: will PG be smart enough to mark dead rows in largely > >> sequential scans (rather than, say, jumping around in whatever order > >> rows from foo are yielded by the above join)? > >> > >> In other words, when considering the alternative of: > >> > >> CREATE TABLE newfoo AS > >> SELECT * FROM bar > >> UNION > >> SELECT * FROM foo > >> WHERE id NOT IN (SELECT id FROM bar); > >> > >> Wouldn't this alternative be faster? > >> > >> > > >> > Kind regards, > >> > > >> > Misa > >> > > >> > > >> > > >> > On Saturday, April 27, 2013, Yang Zhang wrote: > >> >> > >> >> On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > >> >> > Yang Zhang <yanghates...@gmail.com> writes: > >> >> >> It currently takes up to 24h for us to run a large set of UPDATE > >> >> >> statements on a database, which are of the form: > >> >> > > >> >> >> UPDATE table SET field1 = constant1, field2 = constant2, ... > >> >> >> WHERE > >> >> >> id = constid > >> >> > > >> >> >> (We're just overwriting fields of objects identified by ID.) > >> >> > > >> >> > Forgive the obvious question, but you do have an index on "id", > >> >> > right? > >> >> > Have you checked it's being used (ie EXPLAIN ANALYZE on one of > >> >> > these)? > >> >> > >> >> Totally valid question. That is the primary key with its own index. > >> >> Yes, we verified that explain says it just use a simple index scan. > >> >> Each individual query runs reasonably quickly (we can run several > >> >> dozen such statements per second). > >> >> > >> >> > > >> >> >> The tables have handfuls of indices each and no foreign key > >> >> >> constraints. > >> >> > > >> >> > How much is a "handful"? > >> >> > >> >> The table with the largest volume of updates (our bottleneck) has > four > >> >> indexes: > >> >> > >> >> "account_pkey" PRIMARY KEY, btree (id) > >> >> "account_createddate" btree (createddate) > >> >> "account_id_prefix" btree (id text_pattern_ops) > >> >> "account_recordtypeid" btree (recordtypeid) > >> >> > >> >> > > >> >> >> It takes 2h to import a `pg_dump` of the entire DB. This seems > like > >> >> >> a > >