Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Yang Zhang
On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yang Zhang yanghates...@gmail.com writes:
 It currently takes up to 24h for us to run a large set of UPDATE
 statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

 (We're just overwriting fields of objects identified by ID.)

 Forgive the obvious question, but you do have an index on id, right?
 Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

Totally valid question.  That is the primary key with its own index.
Yes, we verified that explain says it just use a simple index scan.
Each individual query runs reasonably quickly (we can run several
dozen such statements per second).


 The tables have handfuls of indices each and no foreign key constraints.

 How much is a handful?

The table with the largest volume of updates (our bottleneck) has four indexes:

account_pkey PRIMARY KEY, btree (id)
account_createddate btree (createddate)
account_id_prefix btree (id text_pattern_ops)
account_recordtypeid btree (recordtypeid)


 It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
 baseline we should reasonably target.

 Well, maybe.  You didn't say what percentage of the DB you're updating.

It can be 10-50% of rows changed - a large portion.


 But the thing that comes to mind here is that you're probably incurring
 a network round trip for each row, and maybe a query-planning round as
 well, so you really can't expect that this is going to be anywhere near
 as efficient as a bulk load operation.  You could presumably get rid of
 the planner overhead by using a prepared statement.  Cutting the network
 overhead is going to require a bit more ingenuity --- could you move
 some logic into a stored procedure, perhaps, so that one command from
 the client is sufficient to update multiple rows?

You're right, we're only sequentially issuing (unprepared) UPDATEs.

If we ship many UPDATE statements per call to our DB API's execution
function (we're using Python's psycopg2 if that matters, but I think
that just binds libpq), would that avoid the network round trip per
statement?

If not, what if we use anonymous procedures (DO) to run multiple
UPDATE statements?

Finally, we could use the technique highlighted in my third bullet and
use COPY (or at least multiple-value INSERT), then merging the new
data with the old.  Would that be the most direct route to maximum
performance?

In any case, I assume deleting and rebuilding indexes is important
here, yes?  But what about raising checkpoint_segments - does this
actually help sustained throughput?


 regards, tom lane


--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Yang Zhang
On Sat, Apr 27, 2013 at 12:24 AM, Yang Zhang yanghates...@gmail.com wrote:
 On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yang Zhang yanghates...@gmail.com writes:
 It currently takes up to 24h for us to run a large set of UPDATE
 statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

 (We're just overwriting fields of objects identified by ID.)

 Forgive the obvious question, but you do have an index on id, right?
 Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

 Totally valid question.  That is the primary key with its own index.
 Yes, we verified that explain says it just use a simple index scan.
 Each individual query runs reasonably quickly (we can run several
 dozen such statements per second).


 The tables have handfuls of indices each and no foreign key constraints.

 How much is a handful?

 The table with the largest volume of updates (our bottleneck) has four 
 indexes:

 account_pkey PRIMARY KEY, btree (id)
 account_createddate btree (createddate)
 account_id_prefix btree (id text_pattern_ops)
 account_recordtypeid btree (recordtypeid)


 It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
 baseline we should reasonably target.

 Well, maybe.  You didn't say what percentage of the DB you're updating.

 It can be 10-50% of rows changed - a large portion.


 But the thing that comes to mind here is that you're probably incurring
 a network round trip for each row, and maybe a query-planning round as
 well, so you really can't expect that this is going to be anywhere near
 as efficient as a bulk load operation.  You could presumably get rid of
 the planner overhead by using a prepared statement.  Cutting the network
 overhead is going to require a bit more ingenuity --- could you move
 some logic into a stored procedure, perhaps, so that one command from
 the client is sufficient to update multiple rows?

 You're right, we're only sequentially issuing (unprepared) UPDATEs.

 If we ship many UPDATE statements per call to our DB API's execution
 function (we're using Python's psycopg2 if that matters, but I think
 that just binds libpq), would that avoid the network round trip per
 statement?

 If not, what if we use anonymous procedures (DO) to run multiple
 UPDATE statements?

 Finally, we could use the technique highlighted in my third bullet and
 use COPY (or at least multiple-value INSERT), then merging the new
 data with the old.  Would that be the most direct route to maximum
 performance?

 In any case, I assume deleting and rebuilding indexes is important
 here, yes?  But what about raising checkpoint_segments - does this
 actually help sustained throughput?

(I ask because I'm wondering if raising checkpoint_segments simply
postpones inevitable work, or if collecting a larger amount of changes
really does dramatically improve throughput somehow.)



 regards, tom lane


 --
 Yang Zhang
 http://yz.mit.edu/



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Misa Simic
Hi,

If dataset for update is large...

Maybe best would be:

From client machine, instead of sending update statements with data -
export data to file ready for copy command
Transfer file to the server where pg is running
Make pgsql function which

Create temp table
Copy to temp from the file

Update original table with values in temp

UPDATE foo
SET foo.col1 = bar.col1
FROM bar
WHERE foo.id = bar.id

You dont need to do delete/insert - if you have just update comands

From client when file is transfered - call your import function on the the
server

Optionaly you can run vacuum analyze after bulk operation...

Kind regards,

Misa



On Saturday, April 27, 2013, Yang Zhang wrote:

 On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.usjavascript:;
 wrote:
  Yang Zhang yanghates...@gmail.com javascript:; writes:
  It currently takes up to 24h for us to run a large set of UPDATE
  statements on a database, which are of the form:
 
  UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
  id = constid
 
  (We're just overwriting fields of objects identified by ID.)
 
  Forgive the obvious question, but you do have an index on id, right?
  Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

 Totally valid question.  That is the primary key with its own index.
 Yes, we verified that explain says it just use a simple index scan.
 Each individual query runs reasonably quickly (we can run several
 dozen such statements per second).

 
  The tables have handfuls of indices each and no foreign key constraints.
 
  How much is a handful?

 The table with the largest volume of updates (our bottleneck) has four
 indexes:

 account_pkey PRIMARY KEY, btree (id)
 account_createddate btree (createddate)
 account_id_prefix btree (id text_pattern_ops)
 account_recordtypeid btree (recordtypeid)

 
  It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
  baseline we should reasonably target.
 
  Well, maybe.  You didn't say what percentage of the DB you're updating.

 It can be 10-50% of rows changed - a large portion.

 
  But the thing that comes to mind here is that you're probably incurring
  a network round trip for each row, and maybe a query-planning round as
  well, so you really can't expect that this is going to be anywhere near
  as efficient as a bulk load operation.  You could presumably get rid of
  the planner overhead by using a prepared statement.  Cutting the network
  overhead is going to require a bit more ingenuity --- could you move
  some logic into a stored procedure, perhaps, so that one command from
  the client is sufficient to update multiple rows?

 You're right, we're only sequentially issuing (unprepared) UPDATEs.

 If we ship many UPDATE statements per call to our DB API's execution
 function (we're using Python's psycopg2 if that matters, but I think
 that just binds libpq), would that avoid the network round trip per
 statement?

 If not, what if we use anonymous procedures (DO) to run multiple
 UPDATE statements?

 Finally, we could use the technique highlighted in my third bullet and
 use COPY (or at least multiple-value INSERT), then merging the new
 data with the old.  Would that be the most direct route to maximum
 performance?

 In any case, I assume deleting and rebuilding indexes is important
 here, yes?  But what about raising checkpoint_segments - does this
 actually help sustained throughput?

 
  regards, tom lane


 --
 Yang Zhang
 http://yz.mit.edu/


 --
 Sent via pgsql-general mailing list 
 (pgsql-general@postgresql.orgjavascript:;
 )
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Yang Zhang
On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic misa.si...@gmail.com wrote:
 Hi,

 If dataset for update is large...

 Maybe best would be:

 From client machine, instead of sending update statements with data - export
 data to file ready for copy command
 Transfer file to the server where pg is running
 Make pgsql function which

 Create temp table
 Copy to temp from the file

 Update original table with values in temp

 UPDATE foo
 SET foo.col1 = bar.col1
 FROM bar
 WHERE foo.id = bar.id

 You dont need to do delete/insert - if you have just update comands

 From client when file is transfered - call your import function on the the
 server

 Optionaly you can run vacuum analyze after bulk operation...

But wouldn't a bulk UPDATE touch many existing pages (say, 20%
scattered around) to mark rows as dead (per MVCC)?  I guess it comes
down to: will PG be smart enough to mark dead rows in largely
sequential scans (rather than, say, jumping around in whatever order
rows from foo are yielded by the above join)?

In other words, when considering the alternative of:

CREATE TABLE newfoo AS
SELECT * FROM bar
UNION
SELECT * FROM foo
WHERE id NOT IN (SELECT id FROM bar);

Wouldn't this alternative be faster?


 Kind regards,

 Misa



 On Saturday, April 27, 2013, Yang Zhang wrote:

 On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yang Zhang yanghates...@gmail.com writes:
  It currently takes up to 24h for us to run a large set of UPDATE
  statements on a database, which are of the form:
 
  UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
  id = constid
 
  (We're just overwriting fields of objects identified by ID.)
 
  Forgive the obvious question, but you do have an index on id, right?
  Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

 Totally valid question.  That is the primary key with its own index.
 Yes, we verified that explain says it just use a simple index scan.
 Each individual query runs reasonably quickly (we can run several
 dozen such statements per second).

 
  The tables have handfuls of indices each and no foreign key
  constraints.
 
  How much is a handful?

 The table with the largest volume of updates (our bottleneck) has four
 indexes:

 account_pkey PRIMARY KEY, btree (id)
 account_createddate btree (createddate)
 account_id_prefix btree (id text_pattern_ops)
 account_recordtypeid btree (recordtypeid)

 
  It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
  baseline we should reasonably target.
 
  Well, maybe.  You didn't say what percentage of the DB you're updating.

 It can be 10-50% of rows changed - a large portion.

 
  But the thing that comes to mind here is that you're probably incurring
  a network round trip for each row, and maybe a query-planning round as
  well, so you really can't expect that this is going to be anywhere near
  as efficient as a bulk load operation.  You could presumably get rid of
  the planner overhead by using a prepared statement.  Cutting the network
  overhead is going to require a bit more ingenuity --- could you move
  some logic into a stored procedure, perhaps, so that one command from
  the client is sufficient to update multiple rows?

 You're right, we're only sequentially issuing (unprepared) UPDATEs.

 If we ship many UPDATE statements per call to our DB API's execution
 function (we're using Python's psycopg2 if that matters, but I think
 that just binds libpq), would that avoid the network round trip per
 statement?

 If not, what if we use anonymous procedures (DO) to run multiple
 UPDATE statements?

 Finally, we could use the technique highlighted in my third bullet and
 use COPY (or at least multiple-value INSERT), then merging the new
 data with the old.  Would that be the most direct route to maximum
 performance?

 In any case, I assume deleting and rebuilding indexes is important
 here, yes?  But what about raising checkpoint_segments - does this
 actually help sustained throughput?

 
  regards, tom lane


 --
 Yang Zhang
 http://yz.mit.edu/


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Yang Zhang
On Sat, Apr 27, 2013 at 2:54 AM, Yang Zhang yanghates...@gmail.com wrote:
 On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic misa.si...@gmail.com wrote:
 Hi,

 If dataset for update is large...

 Maybe best would be:

 From client machine, instead of sending update statements with data - export
 data to file ready for copy command
 Transfer file to the server where pg is running
 Make pgsql function which

 Create temp table
 Copy to temp from the file

 Update original table with values in temp

 UPDATE foo
 SET foo.col1 = bar.col1
 FROM bar
 WHERE foo.id = bar.id

 You dont need to do delete/insert - if you have just update comands

 From client when file is transfered - call your import function on the the
 server

 Optionaly you can run vacuum analyze after bulk operation...

 But wouldn't a bulk UPDATE touch many existing pages (say, 20%
 scattered around) to mark rows as dead (per MVCC)?  I guess it comes
 down to: will PG be smart enough to mark dead rows in largely
 sequential scans (rather than, say, jumping around in whatever order
 rows from foo are yielded by the above join)?

(This then begs the question - how might I see this seemingly
substantial performance implication, one way or the other, in say
EXPLAIN output or something like that?)


 In other words, when considering the alternative of:

 CREATE TABLE newfoo AS
 SELECT * FROM bar
 UNION
 SELECT * FROM foo
 WHERE id NOT IN (SELECT id FROM bar);

 Wouldn't this alternative be faster?


 Kind regards,

 Misa



 On Saturday, April 27, 2013, Yang Zhang wrote:

 On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Yang Zhang yanghates...@gmail.com writes:
  It currently takes up to 24h for us to run a large set of UPDATE
  statements on a database, which are of the form:
 
  UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
  id = constid
 
  (We're just overwriting fields of objects identified by ID.)
 
  Forgive the obvious question, but you do have an index on id, right?
  Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

 Totally valid question.  That is the primary key with its own index.
 Yes, we verified that explain says it just use a simple index scan.
 Each individual query runs reasonably quickly (we can run several
 dozen such statements per second).

 
  The tables have handfuls of indices each and no foreign key
  constraints.
 
  How much is a handful?

 The table with the largest volume of updates (our bottleneck) has four
 indexes:

 account_pkey PRIMARY KEY, btree (id)
 account_createddate btree (createddate)
 account_id_prefix btree (id text_pattern_ops)
 account_recordtypeid btree (recordtypeid)

 
  It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
  baseline we should reasonably target.
 
  Well, maybe.  You didn't say what percentage of the DB you're updating.

 It can be 10-50% of rows changed - a large portion.

 
  But the thing that comes to mind here is that you're probably incurring
  a network round trip for each row, and maybe a query-planning round as
  well, so you really can't expect that this is going to be anywhere near
  as efficient as a bulk load operation.  You could presumably get rid of
  the planner overhead by using a prepared statement.  Cutting the network
  overhead is going to require a bit more ingenuity --- could you move
  some logic into a stored procedure, perhaps, so that one command from
  the client is sufficient to update multiple rows?

 You're right, we're only sequentially issuing (unprepared) UPDATEs.

 If we ship many UPDATE statements per call to our DB API's execution
 function (we're using Python's psycopg2 if that matters, but I think
 that just binds libpq), would that avoid the network round trip per
 statement?

 If not, what if we use anonymous procedures (DO) to run multiple
 UPDATE statements?

 Finally, we could use the technique highlighted in my third bullet and
 use COPY (or at least multiple-value INSERT), then merging the new
 data with the old.  Would that be the most direct route to maximum
 performance?

 In any case, I assume deleting and rebuilding indexes is important
 here, yes?  But what about raising checkpoint_segments - does this
 actually help sustained throughput?

 
  regards, tom lane


 --
 Yang Zhang
 http://yz.mit.edu/


 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general



 --
 Yang Zhang
 http://yz.mit.edu/



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Misa Simic
I dont know - u can test :)

In whole solution it is just one command different - so easy to test and
compare...

To me it doesnt sound as faster... Sounds as more operation needed what
should be done...

And produce more problems...i.e what with table foo? What if another table
refference foo etc...

On Saturday, April 27, 2013, Yang Zhang wrote:

 On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic 
 misa.si...@gmail.comjavascript:;
 wrote:
  Hi,
 
  If dataset for update is large...
 
  Maybe best would be:
 
  From client machine, instead of sending update statements with data -
 export
  data to file ready for copy command
  Transfer file to the server where pg is running
  Make pgsql function which
 
  Create temp table
  Copy to temp from the file
 
  Update original table with values in temp
 
  UPDATE foo
  SET foo.col1 = bar.col1
  FROM bar
  WHERE foo.id = bar.id
 
  You dont need to do delete/insert - if you have just update comands
 
  From client when file is transfered - call your import function on the
 the
  server
 
  Optionaly you can run vacuum analyze after bulk operation...

 But wouldn't a bulk UPDATE touch many existing pages (say, 20%
 scattered around) to mark rows as dead (per MVCC)?  I guess it comes
 down to: will PG be smart enough to mark dead rows in largely
 sequential scans (rather than, say, jumping around in whatever order
 rows from foo are yielded by the above join)?

 In other words, when considering the alternative of:

 CREATE TABLE newfoo AS
 SELECT * FROM bar
 UNION
 SELECT * FROM foo
 WHERE id NOT IN (SELECT id FROM bar);

 Wouldn't this alternative be faster?

 
  Kind regards,
 
  Misa
 
 
 
  On Saturday, April 27, 2013, Yang Zhang wrote:
 
  On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Yang Zhang yanghates...@gmail.com writes:
   It currently takes up to 24h for us to run a large set of UPDATE
   statements on a database, which are of the form:
  
   UPDATE table SET field1 = constant1, field2 = constant2, ...
  WHERE
   id = constid
  
   (We're just overwriting fields of objects identified by ID.)
  
   Forgive the obvious question, but you do have an index on id, right?
   Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?
 
  Totally valid question.  That is the primary key with its own index.
  Yes, we verified that explain says it just use a simple index scan.
  Each individual query runs reasonably quickly (we can run several
  dozen such statements per second).
 
  
   The tables have handfuls of indices each and no foreign key
   constraints.
  
   How much is a handful?
 
  The table with the largest volume of updates (our bottleneck) has four
  indexes:
 
  account_pkey PRIMARY KEY, btree (id)
  account_createddate btree (createddate)
  account_id_prefix btree (id text_pattern_ops)
  account_recordtypeid btree (recordtypeid)
 
  
   It takes 2h to import a `pg_dump` of the entire DB.  This seems like
 a
   baseline we should reasonably target.
  
   Well, maybe.  You didn't say what percentage of the DB you're
 updating.
 
  It can be 10-50% of rows changed - a large portion.
 
  
   But the thing that comes to mind here is that you're probably
 incurring
   a network round trip for each row, and maybe a query-planning round as
   well, so you really can't expect that this is going to be anywhere
 near
   as efficient as a bulk load operation.  You could presumably get rid
 of
   the planner overhead by using a prepared statement.  Cutting the
 network
   overhead is going to require a bit more ingenuity --- could you move
   some logic into a stored procedure, perhaps, so that one command from
   the client is sufficient to update multiple rows?
 
  You're right, we're only sequentially issuing (unprepared) UPDATEs.
 
  If we ship many UPDATE statements per call to our DB API's execution
  function (we're using Python's psycopg2 if that matters, but I think
  that just binds libpq), would that avoid the network round trip per
  statement?
 
  If not, what if we use anonymous procedures (DO) to run multiple
  UPDATE statements?
 
  Finally, we could use the technique highlighted in my third bullet and
  use COPY (or at least multiple-value INSERT), then merging the new
  data with the old.  Would that be the most direct route to maximum
  performance?
 
  In any case, I assume deleting and rebuilding indexes is important
  here, yes?  But what about raising checkpoint_segments - does this
  actually help sustained throughput?
 
  
   regards, tom lane
 
 
  --
  Yang Zhang
  http://yz.mit.edu/
 
 
  --
  Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-general





Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Yang Zhang
On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic misa.si...@gmail.com wrote:
 I dont know - u can test :)

I probably will, but I do have a huge stack of such experiments to run
by now, and it's always tricky / takes care to get benchmarks right,
avoid disk caches, etc.  Certainly I think it would be helpful (or at
least hopefully not harmful) to ask here to see if anyone might just
know.  That's what brought me to this list.  :)


 In whole solution it is just one command different - so easy to test and
 compare...

 To me it doesnt sound as faster... Sounds as more operation needed what
 should be done...

 And produce more problems...i.e what with table foo? What if another table
 refference foo etc...

Yep, I guess more specifically I was just thinking of dumping to a temp table:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);

The question I have remaining is whether the bulk UPDATE will be able
to update many rows efficiently (smartly order them to do largely
sequential scans) - if so, I imagine it would be faster than the
above.


 On Saturday, April 27, 2013, Yang Zhang wrote:

 On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic misa.si...@gmail.com wrote:
  Hi,
 
  If dataset for update is large...
 
  Maybe best would be:
 
  From client machine, instead of sending update statements with data -
  export
  data to file ready for copy command
  Transfer file to the server where pg is running
  Make pgsql function which
 
  Create temp table
  Copy to temp from the file
 
  Update original table with values in temp
 
  UPDATE foo
  SET foo.col1 = bar.col1
  FROM bar
  WHERE foo.id = bar.id
 
  You dont need to do delete/insert - if you have just update comands
 
  From client when file is transfered - call your import function on the
  the
  server
 
  Optionaly you can run vacuum analyze after bulk operation...

 But wouldn't a bulk UPDATE touch many existing pages (say, 20%
 scattered around) to mark rows as dead (per MVCC)?  I guess it comes
 down to: will PG be smart enough to mark dead rows in largely
 sequential scans (rather than, say, jumping around in whatever order
 rows from foo are yielded by the above join)?

 In other words, when considering the alternative of:

 CREATE TABLE newfoo AS
 SELECT * FROM bar
 UNION
 SELECT * FROM foo
 WHERE id NOT IN (SELECT id FROM bar);

 Wouldn't this alternative be faster?

 
  Kind regards,
 
  Misa
 
 
 
  On Saturday, April 27, 2013, Yang Zhang wrote:
 
  On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
   Yang Zhang yanghates...@gmail.com writes:
   It currently takes up to 24h for us to run a large set of UPDATE
   statements on a database, which are of the form:
  
   UPDATE table SET field1 = constant1, field2 = constant2, ...
   WHERE
   id = constid
  
   (We're just overwriting fields of objects identified by ID.)
  
   Forgive the obvious question, but you do have an index on id,
   right?
   Have you checked it's being used (ie EXPLAIN ANALYZE on one of
   these)?
 
  Totally valid question.  That is the primary key with its own index.
  Yes, we verified that explain says it just use a simple index scan.
  Each individual query runs reasonably quickly (we can run several
  dozen such statements per second).
 
  
   The tables have handfuls of indices each and no foreign key
   constraints.
  
   How much is a handful?
 
  The table with the largest volume of updates (our bottleneck) has four
  indexes:
 
  account_pkey PRIMARY KEY, btree (id)
  account_createddate btree (createddate)
  account_id_prefix btree (id text_pattern_ops)
  account_recordtypeid btree (recordtypeid)
 
  
   It takes 2h to import a `pg_dump` of the entire DB.  This seems like
   a
   baseline we should reasonably target.
  
   Well, maybe.  You didn't say what percentage of the DB you're
   updating.
 
  It can be 10-50% of rows changed - a large portion.
 
  
   But the thing that comes to mind here is that you're probably
   incurring
   a network round trip for each row, and maybe a query-planning round
   as
   well, so you really can't expect that this is going to be anywhere
   near
   as efficient as a bulk load operation.  You could presumably get rid
   of
   the planner overhead by using a prepared statement.  Cutting the
   network
   overhead is going to require a bit more ingenuity --- could you move
   some logic into a stored procedure, perhaps, so that one command from
   the client is sufficient to update multiple rows?
 
  You're right, we're only sequentially issuing (unprepared) UPDATEs.
 
  If we ship many UPDATE statements per call to our DB API's execution
  function (we're using Python's psycopg2 if that matters, but I think
  that just binds libpq), would that avoid the network round trip per
  statement?
 
  If not, what if we use anonymous procedures (DO) to run multiple
  UPDATE statements?
 
  

Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Misa Simic
Well

About best approach with large datasets - rarely there is always
true best principle...

You will always see there are a few ways - best one just test confirms -
depends on many things like hardware os etc... Sometimes even depends on
dataset for update...


CREATE TEMP TABLE tmp AS
SELECT * FROM foo;

TRUNCATE foo;

INSERT INTO foo
SELECT * FROM bar
UNION
SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);


Above doesnt amke sense to me..

I would do:

CREATE TEMP TABLE tmp AS
SELECT * FROM foo where 1=2;

COPY tmp FROM 'pathtofile';

UPDATE foo
SET foo.col1 = tmp.col1,
.
.
.
SET foo.col15 = tmp.col15
FROM tmp
WHERE foo.id = tmp.id;

In case I know I need just update... If in my dataset I have mix for update
and potentially new rows

Instead of update command, I would do

DELETE FROM foo WHERE EXISTS (SELECT 1 FROM tmp WHERE tmp.id = foo.id);
INSERT INTO foo
SELECT * FROM tmp;




On Saturday, April 27, 2013, Yang Zhang wrote:

 On Sat, Apr 27, 2013 at 3:06 AM, Misa Simic 
 misa.si...@gmail.comjavascript:;
 wrote:
  I dont know - u can test :)

 I probably will, but I do have a huge stack of such experiments to run
 by now, and it's always tricky / takes care to get benchmarks right,
 avoid disk caches, etc.  Certainly I think it would be helpful (or at
 least hopefully not harmful) to ask here to see if anyone might just
 know.  That's what brought me to this list.  :)

 
  In whole solution it is just one command different - so easy to test and
  compare...
 
  To me it doesnt sound as faster... Sounds as more operation needed what
  should be done...
 
  And produce more problems...i.e what with table foo? What if another
 table
  refference foo etc...

 Yep, I guess more specifically I was just thinking of dumping to a temp
 table:

 CREATE TEMP TABLE tmp AS
 SELECT * FROM foo;

 TRUNCATE foo;

 INSERT INTO foo
 SELECT * FROM bar
 UNION
 SELECT * FROM tmp WHERE id NOT IN (SELECT id FROM bar);

 The question I have remaining is whether the bulk UPDATE will be able
 to update many rows efficiently (smartly order them to do largely
 sequential scans) - if so, I imagine it would be faster than the
 above.

 
  On Saturday, April 27, 2013, Yang Zhang wrote:
 
  On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic misa.si...@gmail.com
 wrote:
   Hi,
  
   If dataset for update is large...
  
   Maybe best would be:
  
   From client machine, instead of sending update statements with data -
   export
   data to file ready for copy command
   Transfer file to the server where pg is running
   Make pgsql function which
  
   Create temp table
   Copy to temp from the file
  
   Update original table with values in temp
  
   UPDATE foo
   SET foo.col1 = bar.col1
   FROM bar
   WHERE foo.id = bar.id
  
   You dont need to do delete/insert - if you have just update
 comands
  
   From client when file is transfered - call your import function on the
   the
   server
  
   Optionaly you can run vacuum analyze after bulk operation...
 
  But wouldn't a bulk UPDATE touch many existing pages (say, 20%
  scattered around) to mark rows as dead (per MVCC)?  I guess it comes
  down to: will PG be smart enough to mark dead rows in largely
  sequential scans (rather than, say, jumping around in whatever order
  rows from foo are yielded by the above join)?
 
  In other words, when considering the alternative of:
 
  CREATE TABLE newfoo AS
  SELECT * FROM bar
  UNION
  SELECT * FROM foo
  WHERE id NOT IN (SELECT id FROM bar);
 
  Wouldn't this alternative be faster?
 
  
   Kind regards,
  
   Misa
  
  
  
   On Saturday, April 27, 2013, Yang Zhang wrote:
  
   On Fri, Apr 26, 2013 at 9:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
Yang Zhang yanghates...@gmail.com writes:
It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:
   
UPDATE table SET field1 = constant1, field2 = constant2, ...
WHERE
id = constid
   
(We're just overwriting fields of objects identified by ID.)
   
Forgive the obvious question, but you do have an index on id,
right?
Have you checked it's being used (ie EXPLAIN ANALYZE on one of
these)?
  
   Totally valid question.  That is the primary key with its own index.
   Yes, we verified that explain says it just use a simple index scan.
   Each individual query runs reasonably quickly (we can run several
   dozen such statements per second).
  
   
The tables have handfuls of indices each and no foreign key
constraints.
   
How much is a handful?
  
   The table with the largest volume of updates (our bottleneck) has
 four
   indexes:
  
   account_pkey PRIMARY KEY, btree (id)
   account_createddate btree (createddate)
   account_id_prefix btree (id text_pattern_ops)
   account_recordtypeid btree (recordtypeid)
  
   
It takes 2h to import a `pg_dump` of the entire DB.  This seems
 like
a
 


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Tom Lane
Yang Zhang yanghates...@gmail.com writes:
 You're right, we're only sequentially issuing (unprepared) UPDATEs.

You definitely want to fix both parts of that, then.

 If we ship many UPDATE statements per call to our DB API's execution
 function (we're using Python's psycopg2 if that matters, but I think
 that just binds libpq), would that avoid the network round trip per
 statement?

Possibly, not sure how psycopg2 handles that.

 If not, what if we use anonymous procedures (DO) to run multiple
 UPDATE statements?

I don't think an anonymous procedure as such would result in any
plan caching, at least not unless you could write it to have a single
UPDATE in a loop.

 Finally, we could use the technique highlighted in my third bullet and
 use COPY (or at least multiple-value INSERT), then merging the new
 data with the old.  Would that be the most direct route to maximum
 performance?

It might help, you'd need to try it.

 In any case, I assume deleting and rebuilding indexes is important
 here, yes?  But what about raising checkpoint_segments - does this
 actually help sustained throughput?

If you're updating as much as 50% of the table, and you don't need the
indexes for other purposes meanwhile, dropping and rebuilding them would
be worth trying.

Also, you definitely want checkpoint_segments large enough so that
checkpoints are at least a few minutes apart.  Excess checkpoints do
represent a sustained drag on performance because they mean a greater
volume of disk writes.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-27 Thread Jasen Betts
On 2013-04-27, Yang Zhang yanghates...@gmail.com wrote:
 On Sat, Apr 27, 2013 at 1:55 AM, Misa Simic misa.si...@gmail.com wrote:

 Optionaly you can run vacuum analyze after bulk operation...

 But wouldn't a bulk UPDATE touch many existing pages (say, 20%
 scattered around) to mark rows as dead (per MVCC)?  I guess it comes
 down to: will PG be smart enough to mark dead rows in largely
 sequential scans (rather than, say, jumping around in whatever order
 rows from foo are yielded by the above join)?

A plpgsql FOR-IN-query loop isn't going to be that smart, it's a
procedural language ans does things procedurally, if you want to do
set operations use SQL.

this:

 UPDATE existing-table SET  FROM temp-table WHERE join-condition;
 
will likely get you a sequential scan over the existing table 

and should be reasonably performant as long as temp-table is small
enough to fit in memory.

-- 
⚂⚃ 100% natural



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Gavin Flower

On 27/04/13 12:14, Yang Zhang wrote:

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
   long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
   merge in old data where ID is not found in new data)
   
http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9...@mail.gmail.com

Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.


People will need to know your version of Postgres  which Operating 
System etc. plus details of CPU RAM, and Disks... AS well as what 
changes you have made to postgresql.conf...


I would be inclined to DROP all indexes and reCREATE them later.

Updating a row might lead to new row being added in a new disk page, so 
I suspect that updates will hit every index associated with the table 
with the (possible exception of partial indexes).


Running too many updates in one transaction, maymean that Postgres may 
need to use disk work files.


Depending on RAM etc, it may pay to increase some variables tat affect 
how Postgres uses RAM, some of theseare per session.



Cheers,
Gavin


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Yang Zhang
We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
m1.xlarge instances, which have:

15 GiB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
64-bit platform

(Yes, we're moving to EBS Optimized instances + Provisioned IOPS
volumes, but prelim. benchmarks suggest this won't get us enough of a
boost as much as possibly refactoring the way we're executing these
bulk updates in our application.)

On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 On 27/04/13 12:14, Yang Zhang wrote:

 It currently takes up to 24h for us to run a large set of UPDATE
 statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

 (We're just overwriting fields of objects identified by ID.)

 The tables have handfuls of indices each and no foreign key constraints.
 No COMMIT is made till the end.

 It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
 baseline we should reasonably target.

 Short of producing a custom program that somehow reconstructs a dataset
 for Postgresql to re-import, is there anything we can do to bring the
 bulk UPDATE performance closer to that of the import?  (This is an area
 that we believe log-structured merge trees handle well, but we're
 wondering if there's anything we can do within Postgresql.)

 Some ideas:

 - dropping all non-ID indices and rebuilding afterward?
 - increasing checkpoint_segments, but does this actually help sustained
   long-term throughput?
 - using the techniques mentioned here?  (Load new data as table, then
   merge in old data where ID is not found in new data)

 http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9...@mail.gmail.com

 Basically there's a bunch of things to try and we're not sure what the
 most effective are or if we're overlooking other things.  We'll be
 spending the next few days experimenting, but we thought we'd ask here
 as well.

 Thanks.


 People will need to know your version of Postgres  which Operating System
 etc. plus details of CPU RAM, and Disks... AS well as what changes you have
 made to postgresql.conf...

 I would be inclined to DROP all indexes and reCREATE them later.

 Updating a row might lead to new row being added in a new disk page, so I
 suspect that updates will hit every index associated with the table with the
 (possible exception of partial indexes).

 Running too many updates in one transaction, may mean that Postgres may need
 to use disk work files.

 Depending on RAM etc, it may pay to increase some variables tat affect how
 Postgres uses RAM, some of these are per session.


 Cheers,
 Gavin



--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Gavin Flower
Please do not top post, the convention in these list are to add stuff at 
the end, apart from comments interspersed to make use of appropriate 
context!


On 27/04/13 13:35, Yang Zhang wrote:

We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
m1.xlarge instances, which have:

15 GiB memory
8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
64-bit platform

(Yes, we're moving to EBS Optimized instances + Provisioned IOPS
volumes, but prelim. benchmarks suggest this won't get us enough of a
boost as much as possibly refactoring the way we're executing these
bulk updates in our application.)

On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:

On 27/04/13 12:14, Yang Zhang wrote:

It currently takes up to 24h for us to run a large set of UPDATE
statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

(We're just overwriting fields of objects identified by ID.)

The tables have handfuls of indices each and no foreign key constraints.
No COMMIT is made till the end.

It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
baseline we should reasonably target.

Short of producing a custom program that somehow reconstructs a dataset
for Postgresql to re-import, is there anything we can do to bring the
bulk UPDATE performance closer to that of the import?  (This is an area
that we believe log-structured merge trees handle well, but we're
wondering if there's anything we can do within Postgresql.)

Some ideas:

- dropping all non-ID indices and rebuilding afterward?
- increasing checkpoint_segments, but does this actually help sustained
   long-term throughput?
- using the techniques mentioned here?  (Load new data as table, then
   merge in old data where ID is not found in new data)

http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9...@mail.gmail.com

Basically there's a bunch of things to try and we're not sure what the
most effective are or if we're overlooking other things.  We'll be
spending the next few days experimenting, but we thought we'd ask here
as well.

Thanks.


People will need to know your version of Postgres  which Operating System
etc. plus details of CPU RAM, and Disks... AS well as what changes you have
made to postgresql.conf...

I would be inclined to DROP all indexes and reCREATE them later.

Updating a row might lead to new row being added in a new disk page, so I
suspect that updates will hit every index associated with the table with the
(possible exception of partial indexes).

Running too many updates in one transaction, may mean that Postgres may need
to use disk work files.

Depending on RAM etc, it may pay to increase some variables tat affect how
Postgres uses RAM, some of these are per session.


Cheers,
Gavin



--
Yang Zhang
http://yz.mit.edu/




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Yang Zhang
On Fri, Apr 26, 2013 at 7:01 PM, Gavin Flower
gavinflo...@archidevsys.co.nz wrote:
 Please do not top post, the convention in these list are to add stuff at the
 end, apart from comments interspersed to make use of appropriate context!

Noted, thanks.

Anyway, any performance hints are greatly appreciated.



 On 27/04/13 13:35, Yang Zhang wrote:

 We're using Postgresql 9.1.9 on Ubuntu 12.04 on EBS volumes on
 m1.xlarge instances, which have:

 15 GiB memory
 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each)
 64-bit platform

 (Yes, we're moving to EBS Optimized instances + Provisioned IOPS
 volumes, but prelim. benchmarks suggest this won't get us enough of a
 boost as much as possibly refactoring the way we're executing these
 bulk updates in our application.)

 On Fri, Apr 26, 2013 at 5:27 PM, Gavin Flower
 gavinflo...@archidevsys.co.nz wrote:

 On 27/04/13 12:14, Yang Zhang wrote:

 It currently takes up to 24h for us to run a large set of UPDATE
 statements on a database, which are of the form:

  UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
  id = constid

 (We're just overwriting fields of objects identified by ID.)

 The tables have handfuls of indices each and no foreign key constraints.
 No COMMIT is made till the end.

 It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
 baseline we should reasonably target.

 Short of producing a custom program that somehow reconstructs a dataset
 for Postgresql to re-import, is there anything we can do to bring the
 bulk UPDATE performance closer to that of the import?  (This is an area
 that we believe log-structured merge trees handle well, but we're
 wondering if there's anything we can do within Postgresql.)

 Some ideas:

 - dropping all non-ID indices and rebuilding afterward?
 - increasing checkpoint_segments, but does this actually help sustained
long-term throughput?
 - using the techniques mentioned here?  (Load new data as table, then
merge in old data where ID is not found in new data)


 http://www.postgresql.org/message-id/3a0028490809301807j59498370m1442d8f5867e9...@mail.gmail.com

 Basically there's a bunch of things to try and we're not sure what the
 most effective are or if we're overlooking other things.  We'll be
 spending the next few days experimenting, but we thought we'd ask here
 as well.

 Thanks.


 People will need to know your version of Postgres  which Operating
 System
 etc. plus details of CPU RAM, and Disks... AS well as what changes you
 have
 made to postgresql.conf...

 I would be inclined to DROP all indexes and reCREATE them later.

 Updating a row might lead to new row being added in a new disk page, so I
 suspect that updates will hit every index associated with the table with
 the
 (possible exception of partial indexes).

 Running too many updates in one transaction, may mean that Postgres may
 need
 to use disk work files.

 Depending on RAM etc, it may pay to increase some variables tat affect
 how
 Postgres uses RAM, some of these are per session.


 Cheers,
 Gavin



 --
 Yang Zhang
 http://yz.mit.edu/





--
Yang Zhang
http://yz.mit.edu/


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing bulk update performance

2013-04-26 Thread Tom Lane
Yang Zhang yanghates...@gmail.com writes:
 It currently takes up to 24h for us to run a large set of UPDATE
 statements on a database, which are of the form:

 UPDATE table SET field1 = constant1, field2 = constant2, ...  WHERE
 id = constid

 (We're just overwriting fields of objects identified by ID.)

Forgive the obvious question, but you do have an index on id, right?
Have you checked it's being used (ie EXPLAIN ANALYZE on one of these)?

 The tables have handfuls of indices each and no foreign key constraints.

How much is a handful?

 It takes 2h to import a `pg_dump` of the entire DB.  This seems like a
 baseline we should reasonably target.

Well, maybe.  You didn't say what percentage of the DB you're updating.

But the thing that comes to mind here is that you're probably incurring
a network round trip for each row, and maybe a query-planning round as
well, so you really can't expect that this is going to be anywhere near
as efficient as a bulk load operation.  You could presumably get rid of
the planner overhead by using a prepared statement.  Cutting the network
overhead is going to require a bit more ingenuity --- could you move
some logic into a stored procedure, perhaps, so that one command from
the client is sufficient to update multiple rows?

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general