Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-14 Thread Marti Raudsepp
On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote:
 Tried removing the indexes and other constraints just for
 the import but for a noob like me, this was too much to ask.  Maybe
 when I get more experience.

pgAdmin should make it pretty easy. Choose each index and constraint,
save the code from the SQL pane for when you need to restore it, and
do a right click - Drop

 Other than being very inefficient, and consuming
 more time than necessary, is there any other down side to importing
 into an indexed table?

Doing so will result in somewhat larger (more bloated) indexes, but
generally the performance impact of this is minimal.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-14 Thread Greg Spiegelberg
On Mon, Mar 14, 2011 at 4:17 AM, Marti Raudsepp ma...@juffo.org wrote:

 On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote:
  Other than being very inefficient, and consuming
  more time than necessary, is there any other down side to importing
  into an indexed table?

 Doing so will result in somewhat larger (more bloated) indexes, but
 generally the performance impact of this is minimal.


Bulk data imports of this size I've done with minimal pain by simply
breaking the raw data into chunks (10M records becomes 10 files of 1M
records), on a separate spindle from the database, and performing multiple
COPY commands but no more than 1 COPY per server core.  I tested this a
while back on a 4 core server and when I attempted 5 COPY's at a time the
time to complete went up almost 30%.  I don't recall any benefit having
fewer than 4 in this case but the server was only processing my data at the
time.  Indexes were on the target table however I dropped all constraints.
 The UNIX split command is handy for breaking the data up into individual
files.

Greg


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-13 Thread runner
Don't insert data into an indexed table. A very important point with


bulk-loading is that you should load all the data first, then create
the indexes. Running multiple (different) CREATE INDEX queries in
parallel can additionally save a lot of time. Also don't move data
back and forth between the tables, just drop the original when you're
done.

I just saw your post and it looks similar to what I'm doing.
We're going to be loading 12G of data from a MySQL dump into our 
pg 9.0.3 database next weekend.  I've been testing this for the last
two weeks.  Tried removing the indexes and other constraints just for
the import but for a noob like me, this was too much to ask.  Maybe
when I get more experience.  So I *WILL* be importing all of my data
into indexed tables.  I timed it and it will take eight hours.  

I'm sure I could get it down to two or three hours for the import
if I really knew more about postgres but that's the price you pay when
you slam dunk a project and your staff isn't familiar with the 
database back-end.  Other than being very inefficient, and consuming 
more time than necessary, is there any other down side to importing 
into an indexed table?  In the four test imports I've done,
everything seems to work fine, just takes a long time.

Sorry for hijacking your thread here!

 


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-12 Thread Marti Raudsepp
On Fri, Mar 11, 2011 at 21:06, fork forkandw...@gmail.com wrote:
 Like the following?  Will it rebuild the indexes in a sensical way?

Don't insert data into an indexed table. A very important point with
bulk-loading is that you should load all the data first, then create
the indexes. Running multiple (different) CREATE INDEX queries in
parallel can additionally save a lot of time. Also don't move data
back and forth between the tables, just drop the original when you're
done.

Doing this should give a significant performance win. Partitioning
them to fit in cache should improve it further, but I'm not sure
anymore that it's worthwhile considering the costs and extra
maintenance.

 Is there a rule of thumb on tradeoffs in a partitioned table?

The only certain thing is that you'll lose group aggregate and
merge join query plans. If you only see HashAggregate plans when
you EXPLAIN your GROUP BY queries then it probably won't make much of
a difference.

 I would use the partition column whatever I am most likely
 to cluster by in a single big table, right?

Yes.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-11 Thread fork
Marti Raudsepp marti at juffo.org writes:

 If you don't mind long recovery times in case of a crash, set
 checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
 will improve write throughput significantly.

Sounds good.

 Also, if you don't mind CORRUPTing your database after a crash,
 setting fsync=off and full_page_writes=off gives another significant
 boost.

I probably won't do this... ;)

 UPDATE on a table with many indexes will probably be slower. If you
 want to speed up this part, use INSERT INTO x SELECT and take this
 chance to partition your table, 

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

 such that each individual partition
 and most indexes will fit in your cache. 

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc, but often it would be GROUP BY state or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
is never going to be that fast, what should one do to make it faster?

I set work_mem to 2048MB, but it currently is only using a little bit of memory
and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
it was using 70% of the memory).

The data is not particularly sensitive; if something happened and it rolled
back, that wouldnt be the end of the world.  So I don't know if I can use
dangerous setting for WAL checkpoints etc.   There are also aren't a lot of
concurrent hits on the DB, though a few.

I am loathe to create a new table from a select, since the indexes themselves
take a really long time to build.

As the title alludes, I will also be doing GROUP BY's on the data, and would
love to speed these up, mostly just for my own impatience...  




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread Merlin Moncure
On Thu, Mar 10, 2011 at 9:40 AM, fork forkandw...@gmail.com wrote:
 Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
 row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
 is never going to be that fast, what should one do to make it faster?

 I set work_mem to 2048MB, but it currently is only using a little bit of 
 memory
 and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
 it was using 70% of the memory).

 The data is not particularly sensitive; if something happened and it rolled
 back, that wouldnt be the end of the world.  So I don't know if I can use
 dangerous setting for WAL checkpoints etc.   There are also aren't a lot of
 concurrent hits on the DB, though a few.

 I am loathe to create a new table from a select, since the indexes themselves
 take a really long time to build.

you are aware that updating the field for the entire table, especially
if there is an index on it (or any field being updated), will cause
all your indexes to be rebuilt anyways?  when you update a record, it
gets a new position in the table, and a new index entry with that
position.  insert/select to temp, + truncate + insert/select back is
usually going to be faster and will save you the reindex/cluster.
otoh, if you have foreign keys it can be a headache.

 As the title alludes, I will also be doing GROUP BY's on the data, and would
 love to speed these up, mostly just for my own impatience...

need to see the query here to see if you can make them go faster.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread fork
Merlin Moncure mmoncure at gmail.com writes:

  I am loathe to create a new table from a select, since the indexes 
  themselves
  take a really long time to build.
 
 you are aware that updating the field for the entire table, especially
 if there is an index on it (or any field being updated), will cause
 all your indexes to be rebuilt anyways?  when you update a record, it
 gets a new position in the table, and a new index entry with that
 position.  
 insert/select to temp, + truncate + insert/select back is
 usually going to be faster and will save you the reindex/cluster.
 otoh, if you have foreign keys it can be a headache.

Hmph.  I guess I will have to find a way to automate it, since there will be a
lot of times I want to do this. 

  As the title alludes, I will also be doing GROUP BY's on the data, and would
  love to speed these up, mostly just for my own impatience...
 
 need to see the query here to see if you can make them go faster.

I guess I was hoping for a blog entry on general guidelines given a DB that is
really only for batch analysis versus transaction processing.  Like put all
your temp tables on a different disk or whatever.  I will post specifics later.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-10 Thread Marti Raudsepp
On Thu, Mar 10, 2011 at 17:40, fork forkandw...@gmail.com wrote:
 The data is not particularly sensitive; if something happened and it rolled
 back, that wouldnt be the end of the world.  So I don't know if I can use
 dangerous setting for WAL checkpoints etc.   There are also aren't a lot of
 concurrent hits on the DB, though a few.

If you don't mind long recovery times in case of a crash, set
checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
will improve write throughput significantly.

Also, if you don't mind CORRUPTing your database after a crash,
setting fsync=off and full_page_writes=off gives another significant
boost.

 I am loathe to create a new table from a select, since the indexes themselves
 take a really long time to build.

UPDATE on a table with many indexes will probably be slower. If you
want to speed up this part, use INSERT INTO x SELECT and take this
chance to partition your table, such that each individual partition
and most indexes will fit in your cache. Index builds from a warm
cache are very fast in PostgreSQL. You can create several indexes at
once in separate sessions, and the table will only be scanned once.

Don't forget to bump up maintenance_work_mem for index builds, 256MB
might be a reasonable arbitrary value.

The downside is that partitioning can interfere with your read queries
if they expect the data in a sorted order. But then, HashAggregate
tends to be faster than GroupAggregate in many cases, so this might
not matter for your queries. Alternatively you can experiment with
PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the
merge append plan node.

 As the title alludes, I will also be doing GROUP BY's on the data, and would
 love to speed these up, mostly just for my own impatience...

I think regular tuning is the best you can do here.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance