Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote: Tried removing the indexes and other constraints just for the import but for a noob like me, this was too much to ask. Maybe when I get more experience. pgAdmin should make it pretty easy. Choose each index and constraint, save the code from the SQL pane for when you need to restore it, and do a right click - Drop Other than being very inefficient, and consuming more time than necessary, is there any other down side to importing into an indexed table? Doing so will result in somewhat larger (more bloated) indexes, but generally the performance impact of this is minimal. Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Mon, Mar 14, 2011 at 4:17 AM, Marti Raudsepp ma...@juffo.org wrote: On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote: Other than being very inefficient, and consuming more time than necessary, is there any other down side to importing into an indexed table? Doing so will result in somewhat larger (more bloated) indexes, but generally the performance impact of this is minimal. Bulk data imports of this size I've done with minimal pain by simply breaking the raw data into chunks (10M records becomes 10 files of 1M records), on a separate spindle from the database, and performing multiple COPY commands but no more than 1 COPY per server core. I tested this a while back on a 4 core server and when I attempted 5 COPY's at a time the time to complete went up almost 30%. I don't recall any benefit having fewer than 4 in this case but the server was only processing my data at the time. Indexes were on the target table however I dropped all constraints. The UNIX split command is handy for breaking the data up into individual files. Greg
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Don't insert data into an indexed table. A very important point with bulk-loading is that you should load all the data first, then create the indexes. Running multiple (different) CREATE INDEX queries in parallel can additionally save a lot of time. Also don't move data back and forth between the tables, just drop the original when you're done. I just saw your post and it looks similar to what I'm doing. We're going to be loading 12G of data from a MySQL dump into our pg 9.0.3 database next weekend. I've been testing this for the last two weeks. Tried removing the indexes and other constraints just for the import but for a noob like me, this was too much to ask. Maybe when I get more experience. So I *WILL* be importing all of my data into indexed tables. I timed it and it will take eight hours. I'm sure I could get it down to two or three hours for the import if I really knew more about postgres but that's the price you pay when you slam dunk a project and your staff isn't familiar with the database back-end. Other than being very inefficient, and consuming more time than necessary, is there any other down side to importing into an indexed table? In the four test imports I've done, everything seems to work fine, just takes a long time. Sorry for hijacking your thread here!
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Fri, Mar 11, 2011 at 21:06, fork forkandw...@gmail.com wrote: Like the following? Will it rebuild the indexes in a sensical way? Don't insert data into an indexed table. A very important point with bulk-loading is that you should load all the data first, then create the indexes. Running multiple (different) CREATE INDEX queries in parallel can additionally save a lot of time. Also don't move data back and forth between the tables, just drop the original when you're done. Doing this should give a significant performance win. Partitioning them to fit in cache should improve it further, but I'm not sure anymore that it's worthwhile considering the costs and extra maintenance. Is there a rule of thumb on tradeoffs in a partitioned table? The only certain thing is that you'll lose group aggregate and merge join query plans. If you only see HashAggregate plans when you EXPLAIN your GROUP BY queries then it probably won't make much of a difference. I would use the partition column whatever I am most likely to cluster by in a single big table, right? Yes. Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Marti Raudsepp marti at juffo.org writes: If you don't mind long recovery times in case of a crash, set checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this will improve write throughput significantly. Sounds good. Also, if you don't mind CORRUPTing your database after a crash, setting fsync=off and full_page_writes=off gives another significant boost. I probably won't do this... ;) UPDATE on a table with many indexes will probably be slower. If you want to speed up this part, use INSERT INTO x SELECT and take this chance to partition your table, Like the following? Will it rebuild the indexes in a sensical way? BEGIN; CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar; TRUNCATE foo; ALTER TABLE foo ADD COLUMN newcol; INSERT INTO foo SELECT * FROM tempfoo; DROP TABLE tempfoo; END; such that each individual partition and most indexes will fit in your cache. Is there a rule of thumb on tradeoffs in a partitioned table? About half the time, I will want to do GROUP BY's that use the partition column, but about half the time I won't. (I would use the partition column whatever I am most likely to cluster by in a single big table, right?) For example, I might intuitively partition by age5 (into 20 tables like tab00, tab05, tab10, etc). Often a query would be SELECT ... FROM PARENTTABLE GROUP BY age5, race, etc, but often it would be GROUP BY state or whatever with no age5 component. I know I can experiment ;), but it takes a while to load anything, and i would rather stand on the shoulders. Thanks so much for all your helps! -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Tuning massive UPDATES and GROUP BY's?
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) is never going to be that fast, what should one do to make it faster? I set work_mem to 2048MB, but it currently is only using a little bit of memory and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, it was using 70% of the memory). The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use dangerous setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Thu, Mar 10, 2011 at 9:40 AM, fork forkandw...@gmail.com wrote: Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) is never going to be that fast, what should one do to make it faster? I set work_mem to 2048MB, but it currently is only using a little bit of memory and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, it was using 70% of the memory). The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use dangerous setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. you are aware that updating the field for the entire table, especially if there is an index on it (or any field being updated), will cause all your indexes to be rebuilt anyways? when you update a record, it gets a new position in the table, and a new index entry with that position. insert/select to temp, + truncate + insert/select back is usually going to be faster and will save you the reindex/cluster. otoh, if you have foreign keys it can be a headache. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... need to see the query here to see if you can make them go faster. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
Merlin Moncure mmoncure at gmail.com writes: I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. you are aware that updating the field for the entire table, especially if there is an index on it (or any field being updated), will cause all your indexes to be rebuilt anyways? when you update a record, it gets a new position in the table, and a new index entry with that position. insert/select to temp, + truncate + insert/select back is usually going to be faster and will save you the reindex/cluster. otoh, if you have foreign keys it can be a headache. Hmph. I guess I will have to find a way to automate it, since there will be a lot of times I want to do this. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... need to see the query here to see if you can make them go faster. I guess I was hoping for a blog entry on general guidelines given a DB that is really only for batch analysis versus transaction processing. Like put all your temp tables on a different disk or whatever. I will post specifics later. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?
On Thu, Mar 10, 2011 at 17:40, fork forkandw...@gmail.com wrote: The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use dangerous setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. If you don't mind long recovery times in case of a crash, set checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this will improve write throughput significantly. Also, if you don't mind CORRUPTing your database after a crash, setting fsync=off and full_page_writes=off gives another significant boost. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. UPDATE on a table with many indexes will probably be slower. If you want to speed up this part, use INSERT INTO x SELECT and take this chance to partition your table, such that each individual partition and most indexes will fit in your cache. Index builds from a warm cache are very fast in PostgreSQL. You can create several indexes at once in separate sessions, and the table will only be scanned once. Don't forget to bump up maintenance_work_mem for index builds, 256MB might be a reasonable arbitrary value. The downside is that partitioning can interfere with your read queries if they expect the data in a sorted order. But then, HashAggregate tends to be faster than GroupAggregate in many cases, so this might not matter for your queries. Alternatively you can experiment with PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the merge append plan node. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience... I think regular tuning is the best you can do here. Regards, Marti -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance