[PERFORM] NOT IN query takes forever
Hi I have 2 tables like this: CREATE TABLE query ( query_idint not null, dat varchar(64) null , sub_acc_id int null , query_ipvarchar(64) null , osd_user_type varchar(64) null ) ; CREATE TABLE trans ( transaction_id varchar(64) not null , datevarchar(64) null , query_idint not null , sub_acc_id int null , reg_acc_id int null ) ; CREATE UNIQUE INDEX query_query_id_idx ON query (query_id) ; CREATE INDEX trans_reg_acc_id_idx ON trans (reg_acc_id) ; CREATE INDEX trans_query_id_idx ON trans(query_id) ; osd= select count(*) from trans osd- ; count 598809 (1 row) osd= osd= select count(*) from query osd- ; count 137042 (1 row) I just vacuum analyse'd the database. Trying to run this query: EXPLAIN ANALYSE select * FROM trans WHERE query_id NOT IN (select query_id FROM query) but it will remain like that forever (cancelled after 30 min). My postgresql.conf is the default: # - Memory - shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each #sort_mem = 1024# min 64, size in KB #vacuum_mem = 8192 # min 1024, size in KB Should I adjust something? Using postgresql 7.4.2, saw in release notes that IN/NOT IN queries are at least as faster than EXISTS. Thank you! -- Marius Andreiana Galuna - Solutii Linux in Romania http://www.galuna.ro ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] pg_autovacuum parameters
Title: Melding Hello all. I am managing a large database with lots of transactions in different tables. The largest tables have around 5-6 millions tuples and around 5-6 inserts and maybe 2 updates pr day. While the smalest tables have only a few tuples and a few updates /inserts pr day. In addition we have small tables with many updates/inserts. So what I am saying is that there is all kinds of tables and uses of tables in our database. This, I think, makes it difficult to set up pg_autovacuum. I am now running vacuum jobs on different tables in cron. What things should I consider when setting but base and threshold values in pg_autovacuum? Since the running of vacuum and analyze is relative to the table size, as it must be, I thinkit isdifficult to cover all tables.. Are there anyone who have some thoughts around this? Regards Rune
Re: [PERFORM] pg_autovacuum parameters
Lending, Rune wrote: Hello all. I am managing a large database with lots of transactions in different tables. The largest tables have around 5-6 millions tuples and around 5-6 inserts and maybe 2 updates pr day. While the smalest tables have only a few tuples and a few updates /inserts pr day. In addition we have small tables with many updates/inserts. So what I am saying is that there is all kinds of tables and uses of tables in our database. This, I think, makes it difficult to set up pg_autovacuum. I am now running vacuum jobs on different tables in cron. What things should I consider when setting but base and threshold values in pg_autovacuum? Since the running of vacuum and analyze is relative to the table size, as it must be, I think it is difficult to cover all tables.. One of the biggest problems with the version of pg_autovacuum in 7.4 contrib is that you can only specify one set of thresholds, which often isn't flexible enough. That said the thresholds are based on table since since you specify both a base value and a scaling factor so pg_autovacuum -v 1000 -V 1 will vacuum a table with 100 rows every 200 updates, but will vacuum a table with 1,000,000 rows every 1,000,100 updates. Are there anyone who have some thoughts around this? Basically, you should be able to use pg_autovacuum to do most of the vacuuming, if there are a few tables that aren't getting vacuumed often enough, then you can add a vacuum command to cron for those specific tables. Matthew ---(end of broadcast)--- TIP 6: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] NOT IN query takes forever
Merlin Moncure [EMAIL PROTECTED] writes: Try bumping up shared buffers some and sort mem as much as you safely can. sort_mem is probably the issue here. The only reasonable way to do NOT IN is with a hash table, and the default setting of sort_mem is probably too small to support a 137042-element table. regards, tom lane ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] NOT IN query takes forever
On Tue, 2004-08-03 at 08:05 -0400, Merlin Moncure wrote: Trying to run this query: EXPLAIN ANALYSE select * FROM trans WHERE query_id NOT IN (select query_id FROM query) but it will remain like that forever (cancelled after 30 min). explain analyze actually runs the query to do timings. Just run explain and see what you come up with. More than likely there is a nestloop in there which is causing the long query time. Try bumping up shared buffers some and sort mem as much as you safely can. Thank you, that did it! With shared_buffers = 3000 # min 16, at least max_connections*2, 8KB each sort_mem = 128000 # min 64, size in KB it takes 3 seconds (my hardware is not server-class). -- Marius Andreiana Galuna - Solutii Linux in Romania http://www.galuna.ro ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html
Re: [PERFORM] NOT IN query takes forever
Try bumping up shared buffers some and sort mem as much as you safely can. Thank you, that did it! With shared_buffers = 3000 # min 16, at least max_connections*2, 8KB each sort_mem = 128000 # min 64, size in KB it takes 3 seconds (my hardware is not server-class). Be careful...sort_mem applies to each connection and (IIRC) in some cases more than once to a connection. Of all the configuration parameters, sort_mem (IMO) is the most important and the hardest to get right. 128k (or 128MB) is awfully high unless you have a ton of memory (you don't) or you are running in single connection scenarios. Do some experimentation by lowering the value until you get a good balance between potential memory consumption and speed. Merlin ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [PERFORM] pg_autovacuum parameters
Matthew T. O'Connor wrote: Lending, Rune wrote: Hello all. I am managing a large database with lots of transactions in different tables. The largest tables have around 5-6 millions tuples and around 5-6 inserts and maybe 2 updates pr day. While the smalest tables have only a few tuples and a few updates /inserts pr day. In addition we have small tables with many updates/inserts. So what I am saying is that there is all kinds of tables and uses of tables in our database. This, I think, makes it difficult to set up pg_autovacuum. I am now running vacuum jobs on different tables in cron. What things should I consider when setting but base and threshold values in pg_autovacuum? Since the running of vacuum and analyze is relative to the table size, as it must be, I think it is difficult to cover all tables.. One of the biggest problems with the version of pg_autovacuum in 7.4 contrib is that you can only specify one set of thresholds, which often isn't flexible enough. That said the thresholds are based on table since since you specify both a base value and a scaling factor so pg_autovacuum -v 1000 -V 1 will vacuum a table with 100 rows every 200 updates, but will vacuum a table with 1,000,000 rows every 1,000,100 updates. Are there anyone who have some thoughts around this? Basically, you should be able to use pg_autovacuum to do most of the vacuuming, if there are a few tables that aren't getting vacuumed often enough, then you can add a vacuum command to cron for those specific tables. And in the version 7.5^H^H^H8.0 ( Tom Lane docet :-) ) I think is possible specify that thresholds per table... Regards Gateano Mendola ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
[PERFORM] Performance Bottleneck
I run a Perl/CGI driven website that makes extensive use of PostgreSQL (7.4.3) for everything from user information to formatting and display of specific sections of the site. The server itself, is a dual processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives mirrored for redundancy running under FreeBSD 5.2.1 (AMD64). Recently loads on the site have increased during peak hours to the point of showing considerable loss in performance.This can be observed when connections move from the 120 concurrent connections to PostgreSQL to roughly 175 or more. Essentially, the machine seems to struggle to keep up with continual requests and slows down respectively as resources are tied down. Code changes have been made to the scripts to essentially back off in high load working environments which have worked to an extent. However, as loads continue to increase the database itself is not taking well to the increased traffic taking place. Having taken a look at 'Tuning PostgreSQL for Performance' (http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as best I could in order to set my settings.However, even with statistics disabled and ever setting tweaked things still consider to deteriorate. Is there anything anyone can recommend in order to give the system a necessary speed boost? It would seem to me that a modest dataset of roughly a Gig combined with that type of hardware should be able to handle substantially more load then what it is. Can anyone provide me with clues as where to pursue?Would disabling 'fsync' provide more performance if I choose that information may be lost in case of a crash? If anyone needs access to logs, settings et cetera. Please ask, I simply wish to test the waters first on what is needed. Thanks! Martin Foster [EMAIL PROTECTED] ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] NOT IN query takes forever
On Tue, 2004-08-03 at 10:10, Merlin Moncure wrote: Try bumping up shared buffers some and sort mem as much as you safely can. Thank you, that did it! With shared_buffers = 3000 # min 16, at least max_connections*2, 8KB each sort_mem = 128000 # min 64, size in KB it takes 3 seconds (my hardware is not server-class). Be careful...sort_mem applies to each connection and (IIRC) in some cases more than once to a connection. Of all the configuration parameters, sort_mem (IMO) is the most important and the hardest to get right. 128k (or 128MB) is awfully high unless you have a ton of memory (you don't) or you are running in single connection scenarios. Do some experimentation by lowering the value until you get a good balance between potential memory consumption and speed. Minor nit, sort_mem actually applies to EACH sort individually, so a query that had to run three sorts could use 3 x sort_mem. Note that one can set sort_mem per backend connection with set sort_mem=128000 if need be so as not to use up all the memory with other backends. ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Performance Bottleneck
On Tue, 3 Aug 2004, Martin Foster wrote: to roughly 175 or more. Essentially, the machine seems to struggle to keep up with continual requests and slows down respectively as resources are tied down. I suggest you try to find queries that are slow and check to see if the plans are optimal for those queries. There are some logging options for logging quries that run longer then a user set limit. That can help finding the slow queries. Just doing some logging for some typical page fetches often show things that can be done better. For example, it's not uncommon to see the same information beeing pulled several times by misstake. Maybe you can also try something like connection pooling. I'm not sure how much that can give, but for small queries the connection time is usually the big part. Would disabling 'fsync' provide more performance if I choose that information may be lost in case of a crash? I would not do that. In most cases the performance increase is modest and the data corruption risk after a crash is much bigger so it's not worth it. If you have a lot of small inserts then it might be faster with this, but if possible it's much better to try to do more work in a transaction then before. -- /Dennis Björklund ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Performance Bottleneck
Hello, It sounds to me like you are IO bound. 2x120GB hard drives just isn't going to cut it with that many connections (as a general rule). Are you swapping ? Sincerely, Joshua D. Drake Martin Foster wrote: I run a Perl/CGI driven website that makes extensive use of PostgreSQL (7.4.3) for everything from user information to formatting and display of specific sections of the site. The server itself, is a dual processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives mirrored for redundancy running under FreeBSD 5.2.1 (AMD64). -- Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC Postgresql support, programming shared hosting and dedicated hosting. +1-503-667-4564 - [EMAIL PROTECTED] - http://www.commandprompt.com Mammoth PostgreSQL Replicator. Integrated Replication for PostgreSQL begin:vcard fn:Joshua D. Drake n:Drake;Joshua D. org:Command Prompt, Inc. adr:;;PO Box 215;Cascade Locks;Oregon;97014;USA email;internet:[EMAIL PROTECTED] title:Consultant tel;work:503-667-4564 tel;fax:503-210-0034 note:Command Prompt, Inc. is the largest and oldest US based commercial PostgreSQL support provider. We provide the only commercially viable integrated PostgreSQL replication solution, but also custom programming, and support. We authored the book Practical PostgreSQL, the procedural language plPHP, and adding trigger capability to plPerl. x-mozilla-html:FALSE url:http://www.commandprompt.com/ version:2.1 end:vcard ---(end of broadcast)--- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] NOT IN query takes forever
Marius Andreiana wrote: On Tue, 2004-08-03 at 08:05 -0400, Merlin Moncure wrote: Trying to run this query: EXPLAIN ANALYSE select * FROM trans WHERE query_id NOT IN (select query_id FROM query) but it will remain like that forever (cancelled after 30 min). explain analyze actually runs the query to do timings. Just run explain and see what you come up with. More than likely there is a nestloop in there which is causing the long query time. Try bumping up shared buffers some and sort mem as much as you safely can. Thank you, that did it! With shared_buffers = 3000 # min 16, at least max_connections*2, 8KB each sort_mem = 128000 # min 64, size in KB 128 MB for sort_mem is too much, consider that in this way each backend can use 128 MB for sort operations... Also shared_buffers = 3000 means 24MB that is not balanced with the 128MB needed for sort... Try to bump up 128 MB for shared_buffer ( may be you need to instruct your OS to allow that ammount of shared memory usage ) and 24MB for sort_mem. Regards Gaetano Mendola ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Performance Bottleneck
Martin Foster wrote: I run a Perl/CGI driven website that makes extensive use of PostgreSQL (7.4.3) for everything from user information to formatting and display of specific sections of the site. The server itself, is a dual processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives mirrored for redundancy running under FreeBSD 5.2.1 (AMD64). Recently loads on the site have increased during peak hours to the point of showing considerable loss in performance.This can be observed when connections move from the 120 concurrent connections to PostgreSQL to roughly 175 or more. Essentially, the machine seems to struggle to keep up with continual requests and slows down respectively as resources are tied down. Code changes have been made to the scripts to essentially back off in high load working environments which have worked to an extent. However, as loads continue to increase the database itself is not taking well to the increased traffic taking place. Having taken a look at 'Tuning PostgreSQL for Performance' (http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as best I could in order to set my settings.However, even with statistics disabled and ever setting tweaked things still consider to deteriorate. Is there anything anyone can recommend in order to give the system a necessary speed boost? It would seem to me that a modest dataset of roughly a Gig combined with that type of hardware should be able to handle substantially more load then what it is. Can anyone provide me with clues as where to pursue?Would disabling 'fsync' provide more performance if I choose that information may be lost in case of a crash? If anyone needs access to logs, settings et cetera. Please ask, I simply wish to test the waters first on what is needed. Thanks! Tell us about your tipical queries, show us your configuration file. The access are only in read only mode or do you have concurrent writers and readers ? During peak hours your processors are tied to 100% ? What say the vmstat and the iostat ? May be you are not using indexes some where, or may be yes but the planner is not using it... In two words we needs other informations in order to help you. Regards Gaetano Mendola ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [PERFORM] NOT IN query takes forever
explain analyze actually runs the query to do timings. Just run explain and see what you come up with. More than likely there is a nestloop in there which is causing the long query time. Try bumping up shared buffers some and sort mem as much as you safely can. Just use an EXISTS query I suggest. Chris ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send unregister YourEmailAddressHere to [EMAIL PROTECTED])
Re: [PERFORM] Performance Bottleneck
Gaetano Mendola wrote: Martin Foster wrote: I run a Perl/CGI driven website that makes extensive use of PostgreSQL (7.4.3) for everything from user information to formatting and display of specific sections of the site. The server itself, is a dual processor AMD Opteron 1.4Ghz w/ 2GB Ram and 2 x 120GB hard drives mirrored for redundancy running under FreeBSD 5.2.1 (AMD64). Recently loads on the site have increased during peak hours to the point of showing considerable loss in performance.This can be observed when connections move from the 120 concurrent connections to PostgreSQL to roughly 175 or more. Essentially, the machine seems to struggle to keep up with continual requests and slows down respectively as resources are tied down. Code changes have been made to the scripts to essentially back off in high load working environments which have worked to an extent. However, as loads continue to increase the database itself is not taking well to the increased traffic taking place. Having taken a look at 'Tuning PostgreSQL for Performance' (http://www.varlena.com/GeneralBits/Tidbits/perf.html) using it as best I could in order to set my settings.However, even with statistics disabled and ever setting tweaked things still consider to deteriorate. Is there anything anyone can recommend in order to give the system a necessary speed boost? It would seem to me that a modest dataset of roughly a Gig combined with that type of hardware should be able to handle substantially more load then what it is. Can anyone provide me with clues as where to pursue?Would disabling 'fsync' provide more performance if I choose that information may be lost in case of a crash? If anyone needs access to logs, settings et cetera. Please ask, I simply wish to test the waters first on what is needed. Thanks! Tell us about your tipical queries, show us your configuration file. The access are only in read only mode or do you have concurrent writers and readers ? During peak hours your processors are tied to 100% ? What say the vmstat and the iostat ? May be you are not using indexes some where, or may be yes but the planner is not using it... In two words we needs other informations in order to help you. Regards Gaetano Mendola I included all the files in attachments, which will hopefully cut down on any replied to Emails.As for things like connection pooling, the web server makes use of Apache::DBI to pool the connections for the Perl scripts being driven on that server.For the sake of being thorough, a quick 'apachectl status' was thrown in when the database was under a good load. Since it would rather slow things down to wait for the servers to really get bogged down with load averages of 20.00 and more, I opted to choose a period of time where we are a bit busier then normal. You will be able to see how the system behaves under a light load and subsequently reaching 125 or so concurrent connections. The queries themselves are simple, normally drawing information from one table with few conditions or in the most complex cases using joins on two table or sub queries. These behave very well and always have, the problem is that these queries take place in rather large amounts due to the dumb nature of the scripts themselves. Over a year ago when I was still using MySQL for the project, the statistics generated would report well over 65 queries per second under loads ranging from 130 to 160 at peak but averaged over the weeks of operation. Looking at the Apache status, one can see that it averages only roughly 2.5 requests per second giving you a slight indication as to what is taking place. A quick run of 'systat -ifstat' shows the following graph: /0 /1 /2 /3 /4 /5 /6 /7 /8 /9 /10 Load Average Interface Traffic PeakTotal lo0 in 0.000 KB/s 0.000 KB/s 37.690 GB out 0.000 KB/s 0.000 KB/s 37.690 GB em0 in 34.638 KB/s 41.986 KB/s 28.998 GB out70.777 KB/s 70.777 KB/s 39.553 GB Em0 is a full duplexed 100Mbs connection to an internal switch that supports the servers directly. Load on the loopback was cut down considerably once I stopped using pg_autovaccum since its performance benefits under low load were buried under the hindrance it caused when traffic was high. I am sure that there are some places that could benefit from some optimization. Especially in the case of indexes, however as a whole the problem seems to be related more to the massive onslaught of queries then it does anything else. Also note that some of these scripts run for longer durations even if they are web based.Some run as long as 30 minutes, making queries to the database from periods of wait from five seconds to twenty-five seconds. Under high duress the