Re: [PERFORM] insert waits for delete with trigger
Thank you. How about: select c.relname, l.pid, l.mode, l.granted, a.current_query from pg_locks l, pg_class c, pg_stat_activity a where l.relation = c.oid AND l.pid = a.procpid order by l.granted, l.pid; relname | pid | mode | granted | current_query ---+---+--+-+--- - q_20040810 | 488 | AccessShareLock | t | IDLE q_20040810 | 488 | RowExclusiveLock | t | IDLE q_process | 3729 | AccessShareLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_process | 3729 | RowExclusiveLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_20040805 | 3729 | AccessShareLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_20040805 | 3729 | RowExclusiveLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_summary | 3729 | AccessShareLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_summary | 3729 | RowExclusiveLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' q_summary_did_dir_idx | 3729 | AccessShareLock | t | DELETE FROM q_20040805 WHERE domain_id ='2005761066' AND module='spam' pg_shadow | 7660 | AccessShareLock | t | IDLE pg_locks | 7660 | AccessShareLock | t | IDLE pg_database | 7660 | AccessShareLock | t | IDLE pg_class | 7660 | AccessShareLock | t | IDLE pg_stat_activity | 7660 | AccessShareLock | t | IDLE pg_class_oid_index| 7660 | AccessShareLock | t | IDLE q_process | 8593 | AccessShareLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_process | 8593 | RowExclusiveLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_20040810 | 8593 | AccessShareLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_20040810 | 8593 | RowExclusiveLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_summary | 8593 | AccessShareLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_summary | 8593 | RowExclusiveLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_summary_did_dir_idx | 8593 | AccessShareLock | t | DELETE FROM q_20040810 WHERE domain_id ='2002300623' AND module='spam' q_process | 19027 | AccessShareLock | t | INSERT INTO q_process (...) SELECT ... FROM q_20040805 WHERE domain_id='2005761066' AND module='spam' q_process | 19027 | RowExclusiveLock | t | INSERT INTO q_process (...) SELECT ... FROM q_20040805 WHERE domain_id='2005761066' AND module='spam' q_20040805 | 19027 | AccessShareLock | t | INSERT INTO q_process (...) SELECT ... FROM q_20040805 WHERE domain_id='2005761066' AND module='spam' q_did_mod_dir_20040805_idx | 19027 | AccessShareLock | t | INSERT INTO q_process (...) SELECT ... FROM q_20040805 WHERE domain_id='2005761066' AND module='spam' (26 rows) ps -elfww|grep 19027 040 S postgres 19027 870 1 69 0- 81290 semtim 07:31 ?00:00:51 postgres: postgres mxl 192.168.0.177:38266 INSERT waiting --- Tom Lane [EMAIL PROTECTED] wrote: Litao Wu [EMAIL PROTECTED] writes: Did I miss something? Your join omits all transaction locks. regards, tom lane __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] The black art of postgresql.conf tweaking
hi, Paul Serby wrote: Can anyone give a good reference site/book for getting the most out of your postgres server. All I can find is contradicting theories on how to work out your settings. This is what I followed to setup our db server that serves our web applications. http://www.phpbuilder.com/columns/smith20010821.php3?page=2 We have a Dell Poweredge with the following spec. CPU: Intel(R) Xeon(TM) CPU 3.06GHz (512 KB Cache) CPU: Intel(R) Xeon(TM) CPU 3.06GHz (512 KB Cache) CPU: Intel(R) Xeon(TM) CPU 3.06GHz (512 KB Cache) CPU: Intel(R) Xeon(TM) CPU 3.06GHz (512 KB Cache) Physical Memory: 2077264 kB Swap Memory: 2048244 kB Apache on the Web server can take up to 300 connections and PHP is using pg_pconnect Postgres is set with the following. max_connections = 300 shared_buffers = 38400 sort_mem = 12000 But Apache is still maxing out the non-super user connection limit. The machine is under no load and I would like to up the max_connections but I would like to know more about what you need to consider before doing so. One more: In php.ini, set the pgsql.max_persistent lower then 300 ; Maximum number of persistent links. -1 means no limit. pgsql.max_persistent = -1 - change this C. ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
[PERFORM] Slow select, insert, update
Having trouble with one table (see time to count records below!). Fairly new to postgres so any help much appreciated. It only contains 9,106 records - as you can see from: select count(id) from project x-tad-biggercount/x-tad-biggerx-tad-bigger /x-tad-biggerx-tad-bigger9106 1 row(s) Total runtime: 45,778.813 ms There are only 3 fields: id integer nextval('id'::text) projectnumber text description text There is one index: id_project_ukey CREATE UNIQUE INDEX id_project_ukey ON project USING btree (id) ... the database is regularly vaccuumed./x-tad-bigger
Re: [PERFORM] [HACKERS] fsync vs open_sync
Anyway, with fsync enabled using standard fsync(), I get roughly 300-400 inserts per second. With fsync disabled, I get about 7000 inserts per second. When I re-enable fsync but use the open_sync option, I can get about 2500 inserts per second. You are getting 300-400 inserts/sec with fsync on? If you don't mind me asking, what's your hardware? (also, have you checked fsync on #s with the new bgwriter in 7.5?) 300 inserts persecond with fsync on using fdatasync. 2500 inserts per second with fsync on using open_sync. [EMAIL PROTECTED] mwoodward]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Xeon(TM) CPU 2.40GHz stepping: 5 cpu MHz : 2399.373 cache size : 512 KB fdiv_bug: no hlt_bug : no f00f_bug: no coma_bug: no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid bogomips: 4784.12 Linux node1 2.4.25 #1 Mon Mar 22 13:33:41 EST 2004 i686 i686 i386 GNU/Linux ide2: BM-DMA at 0xc400-0xc407, BIOS settings: hde:pio, hdf:pio hde: Maxtor 6Y200P0, ATA DISK drive hde: attached ide-disk driver. hde: host protected area = 1 hde: 398297088 sectors (203928 MB) w/7936KiB Cache, CHS=24792/255/63, UDMA(100) PDC20268: IDE controller at PCI slot 06:05.0 ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Slow select, insert, update
Paul Langard [EMAIL PROTECTED] writes: Having trouble with one table (see time to count records below!). Fairly new to postgres so any help much appreciated. It only contains 9,106 records - as you can see from: select count(id) from project count 9106 1 row(s) Total runtime: 45,778.813 ms ... the database is regularly vaccuumed. Hmm. You might try a VACUUM FULL and a REINDEX on the table (you don't say what version you are running--REINDEX is sometimes needed on 7.3 and below). Also, use EXPLAIN ANALYZE on your query and post the result--that's helpful diagnostic information. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863 ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Performance Bottleneck
Squid also takes away the work of doing SSL (presuming you're running it on a different machine). Unfortunately it doesn't support HTTP/1.1 which means that most generated pages (those that don't set Content-length) end up forcing squid to close and then reopen the connection to the web server. It is true that it doesn't support http/1.1, but 'most generated pages'? Unless they are actually emitted progressively they should have a perfectly good content-length header. I've also had some problems when Squid had a large number of connections open (several thousand); though that may have been because of my half_closed_clients setting. Squid 3 coped a lot better when I tried it (quite a few months ago now - and using FreeBSD and the special kqueue system call) but crashed under some (admittedly synthetic) conditions. It runs out of the box with a very conservative setting for max open file descriptors - this may or may not be the cause of the problems you have seen. Certainly I ran squid with 16,000 connections back in 1999... You still have periods of time when the web servers are busy using their CPUs to generate HTML rather than waiting for database queries. This is especially true if you cache a lot of data somewhere on the web servers themselves (which, in my experience, reduces the database load a great deal). If you REALLY need to reduce the number of connections (because you have a large number of web servers doing a lot of computation, say) then it might still be useful. Aha, a postgres related topic in this thread! What you say is very true, but then given that the connection overhead is so vanishingly small, why not simply run without a persistent DB connection in this case? I would maintain that if your webservers are holding open idle DB connections for so long that it's a problem, then simply close the connections! M ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Slow select, insert, update
Paul, Paul Langard wrote: Having trouble with one table (see time to count records below!). Fairly new to postgres so any help much appreciated. It only contains 9,106 records - as you can see from: select count(id) from project *count *9106 1 row(s) Total runtime: 45,778.813 ms snip ... the database is regularly vaccuumed. Have you tried doing a VACUUM FULL, CLUSTER, or drop/restore on the table? This sounds symptomatic of a table with a bunch of dead tuples not in the FSM (free space map). Only tuples in the FSM are reclaimed by a regular VACUUM. If your FSM parameters in postgresql.conf are not big enough for your ratio of UPDATE/DELETE operations to VACUUM frequency, you will end up with dead tuples that will only be reclaimed by a VACUUM FULL. To prevent this problem in the future, look at increasing your FSM size and possibly vacuuming more frequently or using pg_autovacuum. Good Luck, Bill Montgomery ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Column order performance
Josh Berkus wrote: Does the order of columns of varying size have any effect on SELECT/INSERT/UPDATE/and/or/DELETE performance? Take the example where an integer primary key is listed first in the table and alternatively listed after some large varchar or text columns? No, the order of the columns in the table makes no difference. They are not physically stored in the metadata order, anyway; on the data pages, fixed-length fields (e.g. INT, BOOLEAN, etc.) are stored first and variable-length fields (CHAR, TEXT, NUMERIC) after them, AFAIK. Is this true even after a table is altered to append say, an integer column, after there are already variable-length columns in the table? -Bill ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] Slow select, insert, update
Paul Langard [EMAIL PROTECTED] writes: select count(id) from project count 9106 1 row(s) Total runtime: 45,778.813 ms Yipes. The only explanation I can think of is tremendous table bloat. What do you get from vacuum verbose project --- in particular, how many pages in the table? ... the database is regularly vaccuumed. Not regularly enough, perhaps ... or else you need to increase the free space map size parameters. In any case you'll probably need to do one round of vacuum full to get this table back within bounds. regards, tom lane ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] Column order performance
Bill, Does the order of columns of varying size have any effect on SELECT/INSERT/UPDATE/and/or/DELETE performance? Take the example where an integer primary key is listed first in the table and alternatively listed after some large varchar or text columns? No, the order of the columns in the table makes no difference. They are not physically stored in the metadata order, anyway; on the data pages, fixed-length fields (e.g. INT, BOOLEAN, etc.) are stored first and variable-length fields (CHAR, TEXT, NUMERIC) after them, AFAIK. The only thing I have seen elusive reports of is that *display* speed can be afffected by column order (e.g. when you call the query to the command line with many rows) but I've not seen this proven in a test case. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] [HACKERS] fsync vs open_sync
Guys, just so you know: OSDL did some testing and found Ext3 to be perhaps the worst FS for PostgreSQL -- although this testing was with the default options. Ext3 involved an almost 40% write performance penalty compared with Ext2, whereas the penalty for ReiserFS and JFS was less than 10%. This concurs with my personal experience. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] insert waits for delete with trigger
Litao Wu [EMAIL PROTECTED] writes: How about: select c.relname, l.pid, l.mode, l.granted, a.current_query from pg_locks l, pg_class c, pg_stat_activity a where l.relation = c.oid AND l.pid = a.procpid order by l.granted, l.pid; You can't join to pg_class without eliminating the transaction lock rows (because they have NULLs in the relation field). regards, tom lane ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware upgrade for a high-traffic database
Hi Rod, Actually, we're already using a substantial caching system in code for nearly all pages delivered - we've exhausted that option. Our system uses a login/session table for about 1/8 of our page views (those visitors who are logged in), and has tracking features. While I'd love to scrap them and give the database server a vacation, it's a requirement for us. You're correct about the query caching (stored in memory) being used - most of our queries are run once and then come from memory (or, based on speed of consecutive executions, that seems to be the case). Once a user hits a page for the first time in an hour or so, it seems to cache their session query. The issue that I think we're seeing is that the performance on the 3Ware RAID is quite bad, watching FreeBSD systat will show it at 100% busy at around 3.5 MB/s. When it needs to seek across a table (for, say, an aggregate function - typically a COUNT()), it slows the entire server down while working on the disk. Additionally, VACUUM's make the server practically useless. We have indexes on everything that's used in queries, and the planner is using them. The server has 2GB of physical memory, however it's only uses between 130MB and 200MB of it. Postgres is the only application running on the server. Our pertinent settings look like this: max_connections = 512 shared_buffers = 2 sort_mem = 2000 vacuum_mem = 2 effective_cache_size = 30 fsync = false wal_sync_method = fsync wal_buffers = 32 checkpoint_segments = 2 checkpoint_timeout = 30 commit_delay = 1 Typically, we don't use anywhere near the 512 connections - however there are peak hours where we come close, and other times that we eclipse it and run out (should some connections become serialized due to a slowdown). It's not something that we can comfortably lower. The non-standard checkpoint settings have helped making it less likely that a large (in disk time) query will conflict with a checkpoint write. I'm a programmer - definitely not a DBA by any stretch - though I am forced into the role. From reading this list, it seems to me that our settings are reasonable given our usage, and that a disk upgrade is likely in order. I'd love to hear any suggestions. Thanks, Jason -Original Message- From: Rod Taylor [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 10, 2004 7:07 PM To: Jason Coene Cc: Postgresql Performance Subject: Re: [PERFORM] Hardware upgrade for a high-traffic database Our database is about 20GB on disk, we have some quite large tables - 2M rows with TEXT fields in a sample table, accessed constantly. We average about 4,000 - 5,000 queries per second - all from web traffic. As you can 99% is reads? and probably the same data over and over again? You might want to think about a small code change to cache sections of page output in memory for the most commonly generated pages (there are usually 3 or 4 that account for 25% to 50% of web traffic -- starting pages). The fact you're getting 5k queries/second off IDE drives tells me most of the active data is in memory -- so your actual working data set is probably quite small (less than 10% of the 20GB). If the above is all true (mostly reads, smallish dataset, etc.) and the database is not growing very quickly, you might want to look into RAM and RAM bandwidth over disk. An Opteron with 8GB ram using the same old IDE drives. Get a mobo with a SCSI raid controller in it, so the disk component can be upgraded in the future (when necessary). ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Hardware upgrade for a high-traffic database
On Tue, 2004-08-10 at 13:17, Jason Coene wrote: Hi All, We're currently running Postgres 7.4.1 on FreeBSD 5.2, a Dual Xeon 2.4, 2GB ECC, 3Ware Serial ATA RAID 5 w/ 4 disks (SLOW!!). Our database is about 20GB on disk, we have some quite large tables - 2M rows with TEXT fields in a sample table, accessed constantly. We average about 4,000 - 5,000 queries per second - all from web traffic. As you can imagine, we're quite disk limited and checkpoints can be killer. Additionally, we see queries and connections getting serialized due to queries that take a long time (5 sec or so) while waiting on disk access. No fun at all. We've tweaked everything long and hard, and at the end of the day, the disk is killing us. We're looking to upgrade our server - or rather, replace it as it has no upgrade path to SCSI. I'm considering going Opteron (though right now we don't need more CPU time), and am looking for suggestions on what an optimal RAID configuration may look like (disks, controller, cache setting). We're in the market to buy right now - any good vendor suggestions? I've had very good luck with LSI MegaRAID controllers with battery backed cache. The amount of cache doesn't seem as important as having it, and having it set for write back. After that, 2 gigs or more of memory is the next improvement. After that, the speed of the memory. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Hardware upgrade for a high-traffic database
Our database is about 20GB on disk, we have some quite large tables - 2M rows with TEXT fields in a sample table, accessed constantly. We average about 4,000 - 5,000 queries per second - all from web traffic. As you can 99% is reads? and probably the same data over and over again? You might want to think about a small code change to cache sections of page output in memory for the most commonly generated pages (there are usually 3 or 4 that account for 25% to 50% of web traffic -- starting pages). The fact you're getting 5k queries/second off IDE drives tells me most of the active data is in memory -- so your actual working data set is probably quite small (less than 10% of the 20GB). If the above is all true (mostly reads, smallish dataset, etc.) and the database is not growing very quickly, you might want to look into RAM and RAM bandwidth over disk. An Opteron with 8GB ram using the same old IDE drives. Get a mobo with a SCSI raid controller in it, so the disk component can be upgraded in the future (when necessary). ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Bulk Insert and Index use
Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Slow select, insert, update
Does that mean reindex is not needed for PG version 7.4? In what kind situations under PG 7.4, reindex is worthwhile? Thanks, Here is doc from 7.3: PostgreSQL is unable to reuse B-tree index pages in certain cases. The problem is that if indexed rows are deleted, those index pages can only be reused by rows with similar values. For example, if indexed rows are deleted and newly inserted/updated rows have much higher values, the new rows can't use the index space made available by the deleted rows. Instead, such new rows must be placed on new index pages. In such cases, disk space used by the index will grow indefinitely, even if VACUUM is run frequently. As a solution, you can use the REINDEX command periodically to discard pages used by deleted rows. There is also contrib/reindexdb which can reindex an entire database. The counterpart of 7.4 is: In some situations it is worthwhile to rebuild indexes periodically with the REINDEX command. (There is also contrib/reindexdb which can reindex an entire database.) However, PostgreSQL 7.4 has substantially reduced the need for this activity compared to earlier releases. --- Doug McNaught [EMAIL PROTECTED] wrote: Paul Langard [EMAIL PROTECTED] writes: Having trouble with one table (see time to count records below!). Fairly new to postgres so any help much appreciated. It only contains 9,106 records - as you can see from: select count(id) from project count 9106 1 row(s) Total runtime: 45,778.813 ms ... the database is regularly vaccuumed. Hmm. You might try a VACUUM FULL and a REINDEX on the table (you don't say what version you are running--REINDEX is sometimes needed on 7.3 and below). Also, use EXPLAIN ANALYZE on your query and post the result--that's helpful diagnostic information. -Doug -- Let us cross over the river, and rest under the shade of the trees. --T. J. Jackson, 1863 ---(end of broadcast)--- TIP 8: explain analyze is your friend __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Bulk Insert and Index use
If the bulk load has the possibility of duplicating data, then you need to change methods. Try bulk loading into a temp table, index it like the original, eliminate the dups and merge the tables. It is also possible to do an insert from the temp table into the final table like: insert into original (x,x,x) (select temp.1, temp.2, etc from temp left join original on temp.street=original.street where original.street is null) Good Luck Jim Rudi Starcevic wrote: Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Bulk Insert and Index use
Usualy any bulk load is faster with indexes dropped and the rebuilt ... failing that (like you really need the indexes while loading, say into a hot table) be sure to wrap all the SQL into one transaction (BEGIN;...COMMIT;) ... if any data failes it all fails, which is usually easier to deal with than partial data loads, and it is *much* faster than having each insert being its own transaction. HTH, Greg Williamson DBA GlobeXplorer LLC -Original Message- From: Rudi Starcevic [mailto:[EMAIL PROTECTED] Sent: Tue 8/10/2004 4:04 PM To: [EMAIL PROTECTED] Cc: Subject:[PERFORM] Bulk Insert and Index use Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] Bulk Insert and Index use
Hi Jim, Thanks for your time. If the bulk load has the possibility of duplicating data Yes, each row will require either: a) One SELECT + One INSERT or b) One SELECT + One UPDATE I did think of using more than one table, ie. temp table. As each month worth of data is added I expect to see a change from lots of INSERTS to lots of UPDATES. Perhaps when the UPDATES become more dominant it would be best to start using Indexes. While INSERTS are more prevelant perhaps a seq. scan is better. I guess of all the options available it boils down to which is quicker for my data: index or sequential scan. Many thanks. Jim J wrote: If the bulk load has the possibility of duplicating data, then you need to change methods. Try bulk loading into a temp table, index it like the original, eliminate the dups and merge the tables. It is also possible to do an insert from the temp table into the final table like: insert into original (x,x,x) (select temp.1, temp.2, etc from temp left join original on temp.street=original.street where original.street is null) Good Luck Jim Rudi Starcevic wrote: Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html -- Regards, Rudi. Internet Media Productions ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Bulk Insert and Index use
In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Rudi Starcevic) transmitted: A minute for your thoughts and/or suggestions would be great. Could you give a more concrete example? E.g. - the DDL for the table(s), most particularly. At first guess, I think you're worrying about a nonissue. Each insert will lead to a _modification_ of the various indices, which costs _something_, but which is WAY less expensive than creating each index from scratch. But perhaps I'm misreading things; DDL for the intended tables and indices would be real handy. -- output = (cbbrowne @ cbbrowne.com) http://www.ntlug.org/~cbbrowne/linux.html Rules of the Evil Overlord #21. I will hire a talented fashion designer to create original uniforms for my Legions of Terror, as opposed to some cheap knock-offs that make them look like Nazi stormtroopers, Roman footsoldiers, or savage Mongol hordes. All were eventually defeated and I want my troops to have a more positive mind-set. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster
Re: [PERFORM] Bulk Insert and Index use
If it has to read a majority (or even a good percentage) of the rows in question a sequential scan is probably faster ... and as Jim pointed out, a temp table can often be a useful medium for getting speed in a load and then allowing you to clean/alter data for a final (easy) push. G -Original Message- From: Rudi Starcevic [mailto:[EMAIL PROTECTED] Sent: Tue 8/10/2004 8:33 PM To: [EMAIL PROTECTED] Cc: Subject:Re: [PERFORM] Bulk Insert and Index use Hi Jim, Thanks for your time. If the bulk load has the possibility of duplicating data Yes, each row will require either: a) One SELECT + One INSERT or b) One SELECT + One UPDATE I did think of using more than one table, ie. temp table. As each month worth of data is added I expect to see a change from lots of INSERTS to lots of UPDATES. Perhaps when the UPDATES become more dominant it would be best to start using Indexes. While INSERTS are more prevelant perhaps a seq. scan is better. I guess of all the options available it boils down to which is quicker for my data: index or sequential scan. Many thanks. Jim J wrote: If the bulk load has the possibility of duplicating data, then you need to change methods. Try bulk loading into a temp table, index it like the original, eliminate the dups and merge the tables. It is also possible to do an insert from the temp table into the final table like: insert into original (x,x,x) (select temp.1, temp.2, etc from temp left join original on temp.street=original.street where original.street is null) Good Luck Jim Rudi Starcevic wrote: Hi, I have a question on bulk checking, inserting into a table and how best to use an index for performance. The data I have to work with is a monthly CD Rom csv data dump of 300,000 property owners from one area/shire. So every CD has 300,000 odd lines, each line of data which fills the 'property' table. Beginning with the first CD each line should require one SELECT and one INSERT as it will be the first property with this address. The SELECT uses fields like 'street' and 'suburb', to check for an existing property, so I have built an index on those fields. My question is does each INSERT rebuild the index on the 'street' and 'suburb' fields? I believe it does but I'm asking to be sure. If this is the case I guess performance will suffer when I have, say, 200,000 rows in the table. Would it be like: a) Use index to search on 'street' and 'suburb' b) No result? Insert new record c) Rebuild index on 'street' and 'suburb' for each row? Would this mean that after 200,000 rows each INSERT will require the index of 000's of rows to be re-indexed? So far I believe my only options are to use either and index or sequential scan and see which is faster. A minute for your thoughts and/or suggestions would be great. Thanks. Regards, Rudi. ---(end of broadcast)--- TIP 4: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html -- Regards, Rudi. Internet Media Productions ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Bulk Insert and Index use
Hi, In an attempt to throw the authorities off his trail, [EMAIL PROTECTED] (Rudi Starcevic) transmitted: A minute for your thoughts and/or suggestions would be great. Heh heh Could you give a more concrete example? E.g. - the DDL for the table(s), most particularly. Thanks, I didn't add the DDL as I though it may make my question too long. I have the DDL at another office so I'll pick up this email thread when I get there in a couple hours. At first guess, I think you're worrying about a nonissue. Each insert will lead to a _modification_ of the various indices, which costs _something_, but which is WAY less expensive than creating each index from scratch. Very interesting, modification and creation. I will post another email later today. Many thanks. -- Regards, Rudi. Internet Media Productions ---(end of broadcast)--- TIP 8: explain analyze is your friend
[PERFORM] NUMERIC x VARCHAR
Greetings. I have a question regarding performance of certain datatypes: I have a field where I will store my clients phone numbers. I know that this field will never exceed 15 characters, and I will store only numbers here (no dashes, dots, etc...), so I was wondering: Wich type is faster: NUMERIC(15,0) or VARCHAR(15)? Are there any storage differences between them? TIA, -- Er Galvo Abbott Desenvolvedor Web http://www.galvao.eti.br/ [EMAIL PROTECTED]