Re: [PERFORM] pg_connect takes 3.0 seconds
On Wed, 6 Jan 2010, Dmitri Girski wrote: On the other hand, if I use ip addresses this should not attract any possible issues with DNS, right? Not true. It is likely that the server program you are connecting to will perform a reverse DNS lookup to work out who the client is, for logging or authentication purposes. Matthew -- To err is human; to really louse things up requires root privileges. -- Alexander Pope, slightly paraphrased -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] pg_connect takes 3.0 seconds
On Tue, Jan 5, 2010 at 11:50 PM, Craig Ringer cr...@postnewspapers.com.au wrote: Wireshark is your friend. +1. I think if you put a packet sniffer on the interface you are connecting from it will become clear what the problem is in short order. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] pg_connect takes 3.0 seconds
The fact that the delays are clustered at (3 + 0.2 n) seconds, rather than a distributed range, strongly indicates a timeout and not (directly) a resource issue. 3 seconds is too fast for a timeout on almost any DNS operation, unless it has been modified, so I'd suspect it's the TCP layer, e.g. perhaps the SYN packet goes awol and it has to retry. I'd second the vote for investigation with a packet sniffing tool (Wireshark, tcpdump, etc) Cheers Dave On Mon, Jan 4, 2010 at 8:12 PM, Dmitri Girski mite...@gmail.com wrote: Hi everybody, I am running a PostgreSQL server 8.3.5 with a pretty much standard config. The web application server which runs Apache 1.3/PHP2.9 has an intermittent problem: pg_connect takes exactly 3.0 seconds. The usual connection time is 0.0045. The long request happens at approximate rate 1:100. I turned on logs on postgres server side, and there is nothing suspicious for me there. When a connection request comes, it is being served without any delay. Could anyone point me to the direction in which I should investigate this problem further? Thank you in advance! PS The hardware is: Dell SC1435/4Gb/2x2.0GHz/Gentoo Linux. The database web servers are in the 2 local subnets. Dmitri.
Re: [PERFORM] pg_connect takes 3.0 seconds
Dave Crooke wrote: The fact that the delays are clustered at (3 + 0.2 n) seconds, rather than a distributed range, strongly indicates a timeout and not (directly) a resource issue. 3 seconds is too fast for a timeout on almost any DNS operation, unless it has been modified, so I'd suspect it's the TCP layer, e.g. perhaps the SYN packet goes awol and it has to retry. I'd second the vote for investigation with a packet sniffing tool (Wireshark, tcpdump, etc) If you have a PC (Windows), pingplotter is a remarkable and simple tool to use that quickly identifies problems, and gives results that are convincing when you show them to your network admin. Wireshark and tcpdump have a pretty steep learning curve and are overkill if your problem is simple. Craig -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Digesting explain analyze
Hi. I have a table that consists of somewhere in the magnitude of 100.000.000 rows and all rows are of this tuples (id1,id2,evalue); Then I'd like to speed up a query like this: explain analyze select id from table where id1 = 2067 or id2 = 2067 order by evalue asc limit 100; QUERY PLAN --- Limit (cost=1423.28..1423.28 rows=100 width=12) (actual time=2.565..2.567 rows=100 loops=1) - Sort (cost=1423.28..1424.54 rows=505 width=12) (actual time=2.560..2.560 rows=100 loops=1) Sort Key: evalue Sort Method: top-N heapsort Memory: 25kB - Bitmap Heap Scan on table (cost=16.58..1420.75 rows=505 width=12) (actual time=0.709..1.752 rows=450 loops=1) Recheck Cond: ((id1 = 2067) OR (id2 = 2067)) - BitmapOr (cost=16.58..16.58 rows=506 width=0) (actual time=0.676..0.676 rows=0 loops=1) - Bitmap Index Scan on id1_evalue_idx (cost=0.00..11.44 rows=423 width=0) (actual time=0.599..0.599 rows=450 loops=1) Index Cond: (id1_id = 2067) - Bitmap Index Scan on id2_evalue_idx (cost=0.00..4.89 rows=83 width=0) (actual time=0.070..0.070 rows=1 loops=1) Index Cond: (id2_id = 2067) Total runtime: 2.642 ms (12 rows) What I had expected was to see the Bitmap Index Scan on id1_evalue_idx to chop it off at a limit 1. The inner sets are on average 3.000 for both id1 and id2 and a typical limit would be 100, so if I could convince postgresql to not fetch all of them then I would reduce the set retrieved by around 60. The dataset is quite large so the random query is not very likely to be hitting the same part of the dataset again, so there is going to be a fair amount of going to disk., I would also mean that using it in a for loop in a stored-procedure in plpgsql it would not get any benefit from the CURSOR effect? I actually tried to stuff id1,id2 into an array and do a GIST index on the array,evalue hoping that it directly would satisfy this query.. it used the GIST index fetch the rows the post-sorting and limit on the set. What it boils down to is more or less: Does a bitmap index scan support ordering and limit ? Does a multicolummn gist index support ordering and limit ? Have I missed anything that can hugely speed up fetching these (typically 100 rows) from the database. -- Jesper -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Digesting explain analyze
Jesper Krogh wrote: I have a table that consists of somewhere in the magnitude of 100.000.000 rows and all rows are of this tuples (id1,id2,evalue); Then I'd like to speed up a query like this: explain analyze select id from table where id1 = 2067 or id2 = 2067 order by evalue asc limit 100; ...The inner sets are on average 3.000 for both id1 and id2 and a typical limit would be 100, so if I could convince postgresql to not fetch all of them then I would reduce the set retrieved by around 60. The dataset is quite large so the random query is not very likely to be hitting the same part of the dataset again, so there is going to be a fair amount of going to disk., If disk seeks are killing you a kinda crazy idea would be to duplicate the table - clustering one by (id1) and the other one by an index on (id2) and unioning the results of each. Since each of these duplicates of the table will be clustered by the column you're querying it on, it should just take one seek in each table. Then your query could be something like select * from ( select * from t1 where id1=2067 order by evalue limit 100 union select * from t2 where id2=2067 order by evalue limit 100 ) as foo order by evalue limit 100; Hmm.. and I wonder if putting evalue into the criteria to cluster the tables too (i.e. cluster on id1,evalue) if you could make it so the limit finds the right 100 evalues first for each table -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] noob inheritance question
Hello, I am complete noob to Postgres and to this list, and I hope this will be the appropriate list for this question. I'm hoping the inheritance feature will be a nice alternative method for me to implement categories in particular database of products I need to keep updated. I suppose in MySQL I would probably do this by creating, for example, one table for the products, and then a table(s) for categories, and then I'd be faced with a choice between using an adjacency list or nested set paradigm for, say, breadcrumb links in my private web app. On the other hand, in Postgres what I'd like to do it just create an empty root product table, then create, for example, a spirts table that inherits from products, and rums table that inherits from spirits, and then aged rum, flavored rum, et al, which inherit from rums. In this scenario, my idea was to have all my fields in products and to not add any additional fields in the child tables. Also, only the lowest level of child tables in any given branch of products would actually contain data / rows. Assuming this is a good design, what I'm wondering is how inheritance is actually implemented deep down inside Postgres, if it's anything at all like JOINS (say, in the case of merely doing: SELECT * FROM flavored_rum (the lowest level in a given branch) or SELECT * FROM spirits (the root level, or some intermediate level in a given branch) I'm wondering if there's any performance penalty here, analogous to the penalty of JOINs in a regular RDBMS (versus an ORDBMS). If anyone can offer in any insight as too how inheritance is actually executed (compared to JOINs especially), I'd be most grateful. Thank you, DG
Re: [PERFORM] noob inheritance question
On Wed, Jan 6, 2010 at 3:53 PM, Zintrigue zintri...@gmail.com wrote: I'm wondering if there's any performance penalty here, analogous to the penalty of JOINs in a regular RDBMS (versus an ORDBMS). If anyone can offer in any insight as too how inheritance is actually executed (compared to JOINs especially), I'd be most grateful. PostgreSQL inheritance is just a sugar coated form of horizontal table partitioning. So it suffers from all of the problems associated with selection on UNION ALL queries. -- Regards, Richard Broersma Jr. Visit the Los Angeles PostgreSQL Users Group (LAPUG) http://pugs.postgresql.org/lapug -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Digesting explain analyze
On Wed, Jan 6, 2010 at 2:10 PM, Jesper Krogh jes...@krogh.cc wrote: Hi. I have a table that consists of somewhere in the magnitude of 100.000.000 rows and all rows are of this tuples (id1,id2,evalue); Then I'd like to speed up a query like this: explain analyze select id from table where id1 = 2067 or id2 = 2067 order by evalue asc limit 100; QUERY PLAN --- Limit (cost=1423.28..1423.28 rows=100 width=12) (actual time=2.565..2.567 rows=100 loops=1) - Sort (cost=1423.28..1424.54 rows=505 width=12) (actual time=2.560..2.560 rows=100 loops=1) Sort Key: evalue Sort Method: top-N heapsort Memory: 25kB - Bitmap Heap Scan on table (cost=16.58..1420.75 rows=505 width=12) (actual time=0.709..1.752 rows=450 loops=1) Recheck Cond: ((id1 = 2067) OR (id2 = 2067)) - BitmapOr (cost=16.58..16.58 rows=506 width=0) (actual time=0.676..0.676 rows=0 loops=1) - Bitmap Index Scan on id1_evalue_idx (cost=0.00..11.44 rows=423 width=0) (actual time=0.599..0.599 rows=450 loops=1) Index Cond: (id1_id = 2067) - Bitmap Index Scan on id2_evalue_idx (cost=0.00..4.89 rows=83 width=0) (actual time=0.070..0.070 rows=1 loops=1) Index Cond: (id2_id = 2067) Total runtime: 2.642 ms (12 rows) What I had expected was to see the Bitmap Index Scan on id1_evalue_idx to chop it off at a limit 1. The inner sets are on average 3.000 for both id1 and id2 and a typical limit would be 100, so if I could convince postgresql to not fetch all of them then I would reduce the set retrieved by around 60. The dataset is quite large so the random query is not very likely to be hitting the same part of the dataset again, so there is going to be a fair amount of going to disk., I would also mean that using it in a for loop in a stored-procedure in plpgsql it would not get any benefit from the CURSOR effect? I actually tried to stuff id1,id2 into an array and do a GIST index on the array,evalue hoping that it directly would satisfy this query.. it used the GIST index fetch the rows the post-sorting and limit on the set. What it boils down to is more or less: Does a bitmap index scan support ordering and limit ? Does a multicolummn gist index support ordering and limit ? Have I missed anything that can hugely speed up fetching these (typically 100 rows) from the database. Bitmap index scans always return all the matching rows. It would be nice if they could fetch them in chunks for queries like this, but they can't. I am not sure whether there's any way to make GIST do what you want. You might try something like this (untested): SELECT * FROM ( select id from table where id1 = 2067 order by evalue asc limit 100 union all select id from table where id2 = 2067 order by evalue asc limit 100 ) x ORDER BY evalue LIMIT 100 If you have an index by (id1, evalue) and by (id2, evalue) then I would think this would be pretty quick, as it should do two index scans (not bitmap index scans) to fetch 100 rows for each, then append the results, sort them, and then limit again. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] pg_connect takes 3.0 seconds
Hi everybody, Many thanks to everyone replied, I think we are on the right way. I've used tcpdump to generate the logs and there are a lot of dropped packets due to the bad checksum. Network guy is currently looking at the problem and most likely this is hardware issue. Cheers, Dmitri. On Tue, Jan 5, 2010 at 1:12 PM, Dmitri Girski mite...@gmail.com wrote: Hi everybody, I am running a PostgreSQL server 8.3.5 with a pretty much standard config. The web application server which runs Apache 1.3/PHP2.9 has an intermittent problem: pg_connect takes exactly 3.0 seconds. The usual connection time is 0.0045. The long request happens at approximate rate 1:100. I turned on logs on postgres server side, and there is nothing suspicious for me there. When a connection request comes, it is being served without any delay. Could anyone point me to the direction in which I should investigate this problem further? Thank you in advance! PS The hardware is: Dell SC1435/4Gb/2x2.0GHz/Gentoo Linux. The database web servers are in the 2 local subnets. Dmitri. -- @Gmail
Re: [PERFORM] pg_connect takes 3.0 seconds
On Wed, Jan 6, 2010 at 7:44 PM, Dmitri Girski mite...@gmail.com wrote: Hi everybody, Many thanks to everyone replied, I think we are on the right way. I've used tcpdump to generate the logs and there are a lot of dropped packets due to the bad checksum. Network guy is currently looking at the problem and most likely this is hardware issue. 95% of these problems are a bad NIC or a bad cable. Since cables are easy to change I'd try those first, then NICs. Since lots of servers have dual nics that's a pretty easy change too. Every now and then it's a bad switch / router. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] noob inheritance question
Inheritance would only make sense if each of your categories had more columns. Say if you had a wines category and only they had a year column. Its probably not worth it for one or two columns but if you've got a big crazy heterogeneous tree of stuff then its probably appropriate. I'm with Richard in that it sounds like the right way to solve your problem is to have a categories table and a products table. Let the categories table have a reference to the parent. I suppose just like they do in the first section of http://dev.mysql.com/tech-resources/articles/hierarchical-data.html . The other sections on the page just seem like overkill to me. On Wed, Jan 6, 2010 at 7:13 PM, Richard Neill rn...@cam.ac.uk wrote: Zintrigue wrote: I'm hoping the inheritance feature will be a nice alternative method for me to implement categories in particular database of products I need to keep updated. I suppose in MySQL I would probably do this by creating, for example, one table for the products, and then a table(s) for categories, and then I'd be faced with a choice between using an adjacency list or nested set paradigm for, say, breadcrumb links in my private web app. On the other hand, in Postgres what I'd like to do it just create an empty root product table, then create, for example, a spirts table that inherits from products, and rums table that inherits from spirits, and then aged rum, flavored rum, et al, which inherit from rums. In this scenario, my idea was to have all my fields in products and to not add any additional fields in the child tables. Also, only the lowest level of child tables in any given branch of products would actually contain data / rows. Assuming this is a good design, May I venture to stop you there. This sounds like you are doing it The Hard Way. In particular, each time you add a new category, you're going to have to add a new database table, and your schema is going to get to be horrible. Inserts aren't going to be much fun either. Rather than adding multiple child tables, may I suggest some other way of tracking which item is a subset of the other. You could do it by having 2 columns: id, parent_id (each integer and indexed) or you could do it by having 2 columns: id, list (id is integer, list is eg 1,3,5,13) (where the list is a comma-separated list, or an array, and holds the full path) Depending on scale, you may be able to choose a simple algorithm instead of hunting for the most efficient one. Best wishes, Richard P.S. This is the performance mailing list - you may find one of the other lists better suited to your questions. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Joining on text field VS int
Hi, I am going to test this out but would be good to know anyways. A large table is joined to a tiny table (8 rows) on a text field. Should I be joining on an int field eg: recid intead of name? Is the performance affected in either case? Thanks .
Re: [PERFORM] Digesting explain analyze
Ron Mayer wrote: ...The inner sets are on average 3.000 for both id1 and id2 and a typical limit would be 100, so if I could convince postgresql to not fetch all of them then I would reduce the set retrieved by around 60. The dataset is quite large so the random query is not very likely to be hitting the same part of the dataset again, so there is going to be a fair amount of going to disk., If disk seeks are killing you a kinda crazy idea would be to duplicate the table - clustering one by (id1) and the other one by an index on (id2) and unioning the results of each. That's doubling the disk space needs for the table. Is there any odds that this would benefit when the intitial table significantly exceeds available memory by itself? Since each of these duplicates of the table will be clustered by the column you're querying it on, it should just take one seek in each table. Then your query could be something like select * from ( select * from t1 where id1=2067 order by evalue limit 100 union select * from t2 where id2=2067 order by evalue limit 100 ) as foo order by evalue limit 100; This is actually what I ended up with as the best performing query, just still on a single table, because without duplication I can add index and optimize this one by (id1,evalue) and (id2,evalue). It is still getting killed quite a lot by disk IO. So I guess I'm up to: 1) By better disk (I need to get an estimate how large it actually is going to get). 2) Stick with one table, but make sure to have enough activity to get a large part of the index in the OS-cache anyway. (and add more memory if nessesary). The data is seeing a fair amount of growth (doubles in a couple of years ) so it is fairly hard to maintain clustering on them .. I would suspect. Is it possible to get PG to tell me, how many rows that fits in a disk-page. All columns are sitting in plain storage according to \d+ on the table. Hmm.. and I wonder if putting evalue into the criteria to cluster the tables too (i.e. cluster on id1,evalue) if you could make it so the limit finds the right 100 evalues first for each table I didnt cluster it, since clustering locks everything. -- Jesper -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] pg_connect takes 3.0 seconds
On 7/01/2010 10:44 AM, Dmitri Girski wrote: Hi everybody, Many thanks to everyone replied, I think we are on the right way. I've used tcpdump to generate the logs and there are a lot of dropped packets due to the bad checksum. Network guy is currently looking at the problem and most likely this is hardware issue. Hang on a sec. You need to ignore bad checksums on *outbound* packets, because many (most?) Ethernet drivers implement some level of TCP offloading, and this will result in packet sniffers seeing invalid checksums for transmitted packets - the checksums haven't been generated by the NIC yet. Unless you know for sure that your NIC doesn't do TSO, ignore bad checksums on outbound packets from the local interface. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Massive table (500M rows) update nightmare
Our DB has an audit table which is 500M rows and growing. (FYI the objects being audited are grouped semantically, not individual field values). Recently we wanted to add a new feature and we altered the table to add a new column. We are backfilling this varchar(255) column by writing a TCL script to page through the rows (where every update is a UPDATE ... WHERE id = x AND id x+10 and a commit is performed after every 1000 updates statement, i.e. every 1 rows.) We have 10 columns, six of which are indexed. Rough calculations suggest that this will take two to three weeks to complete on an 8-core CPU with more than enough memory. As a ballpark estimate - is this sort of performance for an 500M updates what one would expect of PG given the table structure (detailed below) or should I dig deeper to look for performance issues? As always, thanks! Carlo Table/index structure: CREATE TABLE mdx_core.audit_impt ( audit_impt_id serial NOT NULL, impt_session integer, impt_version character varying(255), impt_name character varying(255), impt_id integer, target_table character varying(255), target_id integer, target_op character varying(10), note text, source_table character varying(255), CONSTRAINT audit_impt_pkey PRIMARY KEY (audit_impt_id) ) CREATE INDEX audit_impt_impt_id_idx ON mdx_core.audit_impt USING btree (impt_id); CREATE INDEX audit_impt_impt_name ON mdx_core.audit_impt USING btree (impt_name, impt_version); CREATE INDEX audit_impt_session_idx ON mdx_core.audit_impt USING btree (impt_session); CREATE INDEX audit_impt_source_table ON mdx_core.audit_impt USING btree (source_table); CREATE INDEX audit_impt_target_id_idx ON mdx_core.audit_impt USING btree (target_id, audit_impt_id); CREATE INDEX audit_impt_target_table_idx ON mdx_core.audit_impt USING btree (target_table, target_id, audit_impt_id); -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Optimizer use of index slows down query by factor
Hi, Tom Lane wrote: I think you need to see about getting this rowcount estimate to be more accurate: - Index Scan using idx_link_1 on link (cost=0.00..680.51 rows=13477 width=26) (actual time=5.707..12.043 rows=126 loops=1) Index Cond: (task_id = 1556) Filter: (((deletable IS NULL) OR (NOT deletable)) AND ((link_type = 8) OR (link_type = 9))) If it realized there'd be only 126 rows out of that scan, it'd probably have gone for a nestloop join against the big table, which I think would be noticeably faster than either of the plans you show here. You already did crank up default_statistics_target, so I'm not sure if raising it further would help any. After i've increased the statistic target for the specific column on the link table alter table link alter task_id set statistics 200;, the sql runs fine ( 1 second ): Limit (cost=448478.40..448492.17 rows=1 width=30) (actual time=850.698..860.838 rows=12 loops=1) - GroupAggregate (cost=448478.40..448492.17 rows=1 width=30) (actual time=850.695..860.824 rows=12 loops=1) - Sort (cost=448478.40..448481.15 rows=1100 width=30) (actual time=850.569..853.985 rows=6445 loops=1) Sort Key: link.link_type, link.link_alias Sort Method: quicksort Memory: 696kB - Nested Loop Left Join (cost=0.00..448422.84 rows=1100 width=30) (actual time=819.519..838.422 rows=6445 loops=1) - Seq Scan on link (cost=0.00..142722.52 rows=203 width=26) (actual time=819.486..820.016 rows=126 loops=1) Filter: (((deletable IS NULL) OR (NOT deletable)) AND (task_id = 1556) AND ((link_type = 8) OR (link_type = 9))) - Index Scan using idx_click_1 on click (cost=0.00..1370.01 rows=10872 width=12) (actual time=0.003..0.088 rows=51 loops=126) Index Cond: (link.link_id = click.link_id) Total runtime: 860.929 ms What I'd suggest is trying to avoid using non-independent AND/OR conditions. For instance recasting the first OR as just deletable is not true would probably result in a better estimate. The size of the error seems to be more than that would account for though, so I suspect that the deletable and link_type conditions are interdependent. Is it practical to recast your data representation to avoid that? I've tried that, but with no positive/negative effects. Thanks for your help. Michael -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Massive table (500M rows) update nightmare
On Thu, Jan 7, 2010 at 12:17 AM, Carlo Stonebanks stonec.regis...@sympatico.ca wrote: Our DB has an audit table which is 500M rows and growing. (FYI the objects being audited are grouped semantically, not individual field values). Recently we wanted to add a new feature and we altered the table to add a new column. We are backfilling this varchar(255) column by writing a TCL script to page through the rows (where every update is a UPDATE ... WHERE id = x AND id x+10 and a commit is performed after every 1000 updates statement, i.e. every 1 rows.) We have 10 columns, six of which are indexed. Rough calculations suggest that this will take two to three weeks to complete on an 8-core CPU with more than enough memory. As a ballpark estimate - is this sort of performance for an 500M updates what one would expect of PG given the table structure (detailed below) or should I dig deeper to look for performance issues? Got an explain analyze of the delete query? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance