Re: [PERFORM] Big question on insert performance/using COPY FROM
Morgan Kita wrote: Hi, I am currently trying to speed up the insertion of bulk loads to my database. I have fiddled with all of the parameters that I have seen suggested(aka checkpoint_segments, checkpoint_timeout, maintinence_work_mem, and shared buffers) with no success. I even turned off fysnc with no effect so I am pretty sure the biggest problem is that the DB is CPU limited at the moment because of the rather weak machine that postmaster is running on(Athlon 2400+ xp with 512 RAM) Don't be pretty sure, be abolutely sure. What do your various system-load figures show? Windows has a system performance monitoring tool that can show CPU/Memory/Disk IO, and *nix tools have vmstat or iostat. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PERFORM] Poor performance on HP Package Cluster
Hi! I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380 G4 with MSA Storrage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1) The system is running under Suse Linux Enterprise Server. My problem is, that the performance is very low. On our old Server ( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10 minutes. ( 1,1GB data ) One of the DL380 it takes more than 90 minutes... Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec. I'm trying to fix the problem for two day's now, googled a lot, but i don't know what to do. Top says, my CPU spends ~50% time with wait io. top - 14:07:34 up 22 min, 3 users, load average: 1.09, 1.04, 0.78 Tasks: 74 total, 3 running, 71 sleeping, 0 stopped, 0 zombie Cpu(s): 50.0% us, 5.0% sy, 0.0% ni, 0.0% id, 45.0% wa, 0.0% hi, 0.0% si Mem: 6050356k total, 982004k used, 5068352k free,60300k buffers Swap: 2097136k total,0k used, 2097136k free, 786200k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 9939 postgres 18 0 254m 143m 140m R 49.3 2.4 8:35.43 postgres: postgres plate [local] INSERT 9938 postgres 16 0 13720 1440 1120 S 4.9 0.0 0:59.08 psql -d plate -f dump.sql 10738 root 15 0 3988 1120 840 R 4.9 0.0 0:00.05 top -d 0.2 1 root 16 0 640 264 216 S 0.0 0.0 0:05.03 init [3] 2 root 34 19 000 S 0.0 0.0 0:00.00 [ksoftirqd/0] vmstat 1: ClusterNode2 root $ vmstat 1 procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0 0 5032012 60888 82100800 216 6938 1952 5049 40 8 15 37 0 1 0 5031392 60892 82163200 0 8152 2126 5725 45 6 0 49 0 1 0 5030896 60900 82214400 0 8124 2052 5731 46 6 0 47 0 1 0 5030400 60908 82276800 0 8144 2124 5717 44 7 0 50 1 0 0 5029904 60924 82327200 0 8304 2062 5763 43 7 0 49 I've read (2004), that Xeon may have problems with content switching - is the problem still existing? Can I do something to minimize the problem? postgresql.conf: shared_buffers = 28672 effective_cache_size = 40 random_page_cost = 2 shmall shmmax are set to 268435456 hdparm: ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1 /dev/cciss/c0d0p1: Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec greetings Ernst ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[PERFORM] Need for speed 3
Hi again, first I want to say ***THANK YOU*** for everyone who kindly shared their thoughts on my hardware problems. I really appreciate it. I started to look for a new server and I am quite sure we'll get a serious hardware update. As suggested by some people I would like now to look closer at possible algorithmic improvements. My application basically imports Apache log files into a Postgres database. Every row in the log file gets imported in one of three (raw data) tables. My columns are exactly as in the log file. The import is run approx. every five minutes. We import about two million rows a month. Between 30 and 50 users are using the reporting at the same time. Because reporting became so slow, I did create a reporting table. In that table data is aggregated by dropping time (date is preserved), ip, referer, user-agent. And although it breaks normalization some data from a master table is copied, so no joins are needed anymore. After every import the data from the current day is deleted from the reporting table and recalculated from the raw data table. Is this description understandable? If so What do you think of this approach? Are there better ways to do it? Is there some literature you recommend reading? TIA Ulrich ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Poor performance on HP Package Cluster
Your HD raw IO rate seems fine, so the problem is not likely to be with the HDs. That consistent ~10x increase in how long it takes to do an import or a select is noteworthy. This smells like an interconnect problem. Was the Celeron locally connected to the HDs while the new Xeons are network connected? Getting 10's or even 100's of MBps throughput out of local storage is much easier than it is to do over a network. 1GbE is required if you want HDs to push 72.72MBps over a network, and not even one 10GbE line will allow you to match local buffered IO of 1885.34MBps. What size are those network connects (Server A - storage, Server B - storage, Server A - Server B)? Ron Peacetree At 10:16 AM 9/1/2005, Ernst Einstein wrote: I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380 G4 with MSA Storage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1). The system is running under Suse Linux Enterprise Server. My problem is, that the performance is very low. On our old Server ( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10 minutes. ( 1,1GB data ). One of the DL380 it takes more than 90 minutes... Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec. I'm trying to fix the problem for two day's now, googled a lot, but i don't know what to do. Top says, my CPU spends ~50% time with wait io. top - 14:07:34 up 22 min, 3 users, load average: 1.09, 1.04, 0.78 Tasks: 74 total, 3 running, 71 sleeping, 0 stopped, 0 zombie Cpu(s): 50.0% us, 5.0% sy, 0.0% ni, 0.0% id, 45.0% wa, 0.0% hi, 0.0% si Mem: 6050356k total, 982004k used, 5068352k free,60300k buffers Swap: 2097136k total,0k used, 2097136k free, 786200k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+COMMAND 9939 postgres 18 0 254m 143m 140m R 49.3 2.48:35.43 postgres:postgres plate [local] INSERT 9938 postgres 16 0 13720 1440 1120 S 4.9 0.00:59.08 psql -d plate -f dump.sql 10738 root 15 0 3988 1120 840 R 4.9 0.00:00.05 top -d 0.2 1 root 16 0 640264 216 S 0.0 0.0 0:05.03 init[3] 2 root 34 19 0 0 0 S 0.0 0.0 0:00.00 [ksoftirqd/0] vmstat 1: ClusterNode2 root $ vmstat 1 procs ---memory-- ---swap-- -io --system--cpu r b swpd freebuff cachesi sobi bo in cs us sy id wa 1 0 0 5032012 60888 82100800 216 6938 1952 5049 40 8 15 37 0 1 0 5031392 60892 82163200 0 8152 2126 5725 45 6 0 49 0 1 0 5030896 60900 82214400 0 8124 2052 5731 46 6 0 47 0 1 0 5030400 60908 82276800 0 8144 2124 5717 44 7 0 50 1 0 0 5029904 60924 82327200 0 8304 2062 5763 43 7 0 49 I've read (2004), that Xeon may have problems with content switching - is the problem still existing? Can I do something to minimize the problem? postgresql.conf: shared_buffers = 28672 effective_cache_size = 40 random_page_cost = 2 shmall shmmax are set to 268435456 hdparm: ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1 /dev/cciss/c0d0p1: Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Need for speed 3
Ulrich wrote: Hi again, first I want to say ***THANK YOU*** for everyone who kindly shared their thoughts on my hardware problems. I really appreciate it. I started to look for a new server and I am quite sure we'll get a serious hardware update. As suggested by some people I would like now to look closer at possible algorithmic improvements. My application basically imports Apache log files into a Postgres database. Every row in the log file gets imported in one of three (raw data) tables. My columns are exactly as in the log file. The import is run approx. every five minutes. We import about two million rows a month. Between 30 and 50 users are using the reporting at the same time. Because reporting became so slow, I did create a reporting table. In that table data is aggregated by dropping time (date is preserved), ip, referer, user-agent. And although it breaks normalization some data from a master table is copied, so no joins are needed anymore. After every import the data from the current day is deleted from the reporting table and recalculated from the raw data table. schemas would be helpful. You may be able to tweak the import table a bit and how it moves over to the data tables. Just a thought: have you considered having apache logs write to a process that immediately makes insert query(s) to postgresql? You could write small C program which executes advanced query interface call to the server. Merlin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Need for speed 3
Hi Merlin, schemas would be helpful. right now I would like to know if my approach to the problem makes sense. Or if I should rework the whole procedure of import and aggregate. Just a thought: have you considered having apache logs write to a process that immediately makes insert query(s) to postgresql? Yes we have considered that, but dismissed the idea very soon. We need Apache to be as responsive as possible. It's a two server setup with load balancer and failover. Serving about ones thousand domains and counting. It needs to be as failsafe as possible and under no circumstances can any request be lost. (The click counting is core business and relates directly to our income.) That said it seemed quite save to let Apache write logfiles. And import them later. By that a database downtime wouldn't be mission critical. You could write small C program which executes advanced query interface call to the server. How would that improve performance? Ulrich ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Need for speed 3
Hi Merlin, Just a thought: have you considered having apache logs write to a process that immediately makes insert query(s) to postgresql? Yes we have considered that, but dismissed the idea very soon. We need Apache to be as responsive as possible. It's a two server setup with load balancer and failover. Serving about ones thousand domains and counting. It needs to be as failsafe as possible and under no circumstances can any request be lost. (The click counting is core business and relates directly to our income.) That said it seemed quite save to let Apache write logfiles. And import them later. By that a database downtime wouldn't be mission critical. hm. well, it may be possible to do this in a fast and safe way but I understand your reservations here, but I'm going to spout off my opinion anyways :). If you are not doing this the following point is moot. But take into consideration you could set a very low transaction time out (like .25 seconds) and siphon log entries off to a text file if your database server gets in trouble. 2 million hits a month is not very high even if your traffic is bursty (there are approx 2.5 million seconds in a month). With a direct linked log file you get up to date stats always and spare yourself the dump/load song and dance which is always a headache :(. Also, however you are doing your billing, it will be easier to manage it if everything is extracted from pg and not some conglomeration of log files, *if* you can put 100% faith in your database. When it comes to pg now, I'm a believer. You could write small C program which executes advanced query interface call to the server. How would that improve performance? The functions I'm talking about are PQexecParams and PQexecPrepared. The query string does not need to be encoded or decoded and is very light on server resources and is very low latency. Using them you could get prob. 5000 inserts/sec on a cheap server if you have some type of write caching in place with low cpu load. Merlin ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Need for speed 3
Ulrich, On 9/1/05 6:25 AM, Ulrich Wisser [EMAIL PROTECTED] wrote: My application basically imports Apache log files into a Postgres database. Every row in the log file gets imported in one of three (raw data) tables. My columns are exactly as in the log file. The import is run approx. every five minutes. We import about two million rows a month. Bizgres Clickstream does this job using an ETL (extract transform and load) process to transform the weblogs into an optimized schema for reporting. After every import the data from the current day is deleted from the reporting table and recalculated from the raw data table. This is something the optimized ETL in Bizgres Clickstream also does well. What do you think of this approach? Are there better ways to do it? Is there some literature you recommend reading? I recommend the Bizgres Clickstream docs, you can get it from Bizgres CVS, and there will shortly be a live html link on the website. Bizgres is free - it also improves COPY performance by almost 2x, among other enhancements. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Massive performance issues
Hi, I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: \d address Table public.address Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null postcode | character varying(10) | not null property_type| character varying(15) | not null sale_type| character varying(10) | not null flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Indexes: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. Queries such as: select locality_2 from address where locality_2 = 'Manchester'; are taking 14 seconds to complete, and this is only 2 years worth of data - we will have up to 15 years (so over 15 million rows). Interestingly, doing: explain select locality_2 from address where locality_2 = 'Manchester'; gives QUERY PLAN Seq Scan on address (cost=0.00..80677.16 rows=27923 width=12) Filter: ((locality_2)::text = 'Manchester'::text) but: explain select locality_1 from address where locality_1 = 'Manchester'; gives QUERY PLAN Index Scan using address_locality_1_index on address (cost=0.00..69882.18 rows=17708 width=13) Index Cond: ((locality_1)::text = 'Manchester'::text) Sadly, using the index makes things worse, the query taking 17 seconds. locality_1 has 16650 distinct values and locality_2 has 1156 distinct values. Whilst the locality_2 query is in progress, both the disk and the CPU are maxed out with the disk constantly reading at 60MB/s and the CPU rarely dropping under 100% load. With the locality_1 query in progress, the CPU is maxed out but the disk is reading at just 3MB/s. Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? What can I do to make this go faster? I've considered normalising the table but I can't work out whether the slowness is in dereferencing the pointers from the index into the table or in scanning the index in the first place. And normalising the table is going to cause much pain when inserting values and I'm not entirely sure if I see why normalising it should cause a massive performance improvement. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 and have the results very quickly. Any help most gratefully received (even if it's to say that I should be posting to a different mailing list!). Many thanks, Matthew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Massive performance issues
I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: locality_1 has 16650 distinct values and locality_2 has 1156 distinct values. Just so you know I have a 2GHz p4 workstation with similar size (2M rows), several keys, and can find and fetch 2k rows based on 20k unique value key in about 60 ms. (.06 seconds). Merlin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Massive performance issues
Table public.address Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null consider making above fields char(x) not varchar(x) for small but important savings. postcode | character varying(10) | not null property_type| character varying(15) | not null sale_type| character varying(10) | not null flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Indexes: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? What can I do to make this go faster? I've considered normalising the table but I can't work out whether the slowness is in dereferencing the pointers from the index into the table or in scanning the index in the first place. And normalising the table is going to cause much pain when inserting values and I'm not entirely sure if I see why normalising it should cause a massive performance improvement. http://www.dbdebunk.com :) I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 and have the results very quickly. Any help most gratefully received (even if it's to say that I should be posting to a different mailing list!). this is correct list. did you run vacuum/analyze, etc? Please post vacuum analyze times. Merlin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 02:47:06PM -0400, Tom Lane wrote: Matthew Sackman [EMAIL PROTECTED] writes: Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? Yes. Pulling twenty thousand rows at random from a table isn't free. I appreciate that. But I'm surprised by how un-free it seems to be. And it seems others here have performance I need on similar hardware. You were pretty vague about your disk hardware, which makes me think you didn't spend a lot of money on it ... and on low-ball hardware, that sort of random access speed just isn't gonna happen. Well, this is a development box. But the live box wouldn't be much more than RAID 1 on SCSI 10ks so that should only be a halving of seek time, not the 1000 times reduction I'm after! In fact, now I think about it, I have been testing on a 2.4 kernel on a dual HT 3GHz Xeon with SCSI RAID array and the performance is only marginally better. If the queries you need are very consistent, you might be able to get some mileage out of CLUSTERing by the relevant index ... but the number of indexes you've created makes me think that's not so ... No, the queries, whilst in just three distinct forms, will effectively be for fairly random values. Matthew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote: Any help most gratefully received (even if it's to say that I should be posting to a different mailing list!). this is correct list. did you run vacuum/analyze, etc? Please post vacuum analyze times. 2005-09-01 19:47:08 LOG: statement: vacuum full analyze address; 2005-09-01 19:48:44 LOG: duration: 96182.777 ms 2005-09-01 19:50:20 LOG: statement: vacuum analyze address; 2005-09-01 19:51:48 LOG: duration: 87675.268 ms I run them regularly, pretty much after every bulk import. Matthew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Massive performance issues
Matthew Sackman [EMAIL PROTECTED] writes: Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? Yes. Pulling twenty thousand rows at random from a table isn't free. You were pretty vague about your disk hardware, which makes me think you didn't spend a lot of money on it ... and on low-ball hardware, that sort of random access speed just isn't gonna happen. If the queries you need are very consistent, you might be able to get some mileage out of CLUSTERing by the relevant index ... but the number of indexes you've created makes me think that's not so ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Massive performance issues
Any chance it's a vacuum thing? Or configuration (out of the box it needs adjusting)? Joel Fradkin -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Merlin Moncure Sent: Thursday, September 01, 2005 2:11 PM To: Matthew Sackman Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] Massive performance issues I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: locality_1 has 16650 distinct values and locality_2 has 1156 distinct values. Just so you know I have a 2GHz p4 workstation with similar size (2M rows), several keys, and can find and fetch 2k rows based on 20k unique value key in about 60 ms. (.06 seconds). Merlin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Massive performance issues
Matthew Sackman schrieb: Hi, I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: \d address Table public.address Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null postcode | character varying(10) | not null property_type| character varying(15) | not null sale_type| character varying(10) | not null flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Indexes: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. Queries such as: select locality_2 from address where locality_2 = 'Manchester'; are taking 14 seconds to complete, and this is only 2 years worth of data - we will have up to 15 years (so over 15 million rows). Interestingly, doing: explain select locality_2 from address where locality_2 = 'Manchester'; gives QUERY PLAN Seq Scan on address (cost=0.00..80677.16 rows=27923 width=12) Filter: ((locality_2)::text = 'Manchester'::text) but: explain select locality_1 from address where locality_1 = 'Manchester'; gives QUERY PLAN Index Scan using address_locality_1_index on address (cost=0.00..69882.18 rows=17708 width=13) Index Cond: ((locality_1)::text = 'Manchester'::text) Sadly, using the index makes things worse, the query taking 17 seconds. locality_1 has 16650 distinct values and locality_2 has 1156 distinct values. Whilst the locality_2 query is in progress, both the disk and the CPU are maxed out with the disk constantly reading at 60MB/s and the CPU rarely dropping under 100% load. With the locality_1 query in progress, the CPU is maxed out but the disk is reading at just 3MB/s. Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? What can I do to make this go faster? I've considered normalising the table but I can't work out whether the slowness is in dereferencing the pointers from the index into the table or in scanning the index in the first place. And normalising the table is going to cause much pain when inserting values and I'm not entirely sure if I see why normalising it should cause a massive performance improvement. Just an idea: When you do not want to adapt your application to use a normalized database you may push the data into normalized table using triggers. Example: Add a table city with column id, name and add a column city_id to your main table. In this case you have redundant data in your main table (locality_1 and city_id) but you could make queries to the city table when searching for 'Man%' -- Best Regards / Viele Grüße Sebastian Hennebrueder http://www.laliluna.de Tutorials for JSP, JavaServer Faces, Struts, Hibernate and EJB Get support, education and consulting for these technologies. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 02:04:54PM -0400, Merlin Moncure wrote: Table public.address Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null consider making above fields char(x) not varchar(x) for small but important savings. Huh, hang on -- AFAIK there's no saving at all by doing that. Quite the opposite really, because with char(x) you store the padding blanks, which are omitted with varchar(x), so less I/O (not necessarily a measurable amount, mind you, maybe even zero because of padding issues.) -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com You liked Linux a lot when he was just the gawky kid from down the block mowing your lawn or shoveling the snow. But now that he wants to date your daughter, you're not so sure he measures up. (Larry Greenemeier) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Poor performance on HP Package Cluster
Are you using the built-in HP SmartArray RAID/SCSI controllers? If so, that could be your problem, they are known to have terrible and variable performance with Linux. The only good fix is to add a simple SCSI controller to your system (HP sells them) and stay away from hardware RAID. - Luke On 9/1/05 7:16 AM, Ernst Einstein [EMAIL PROTECTED] wrote: Hi! I've set up a Package Cluster ( Fail-Over Cluster ) on our two HP DL380 G4 with MSA Storrage G2.( Xeon 3,4Ghz, 6GB Ram, 2x [EMAIL PROTECTED] Raid1) The system is running under Suse Linux Enterprise Server. My problem is, that the performance is very low. On our old Server ( Celeron 2Ghz with 2 GB of Ram ) an import of our Data takes about 10 minutes. ( 1,1GB data ) One of the DL380 it takes more than 90 minutes... Selects response time have also been increased. Celeron 3 sec, Xeon 30-40sec. I'm trying to fix the problem for two day's now, googled a lot, but i don't know what to do. Top says, my CPU spends ~50% time with wait io. top - 14:07:34 up 22 min, 3 users, load average: 1.09, 1.04, 0.78 Tasks: 74 total, 3 running, 71 sleeping, 0 stopped, 0 zombie Cpu(s): 50.0% us, 5.0% sy, 0.0% ni, 0.0% id, 45.0% wa, 0.0% hi, 0.0% si Mem: 6050356k total, 982004k used, 5068352k free,60300k buffers Swap: 2097136k total,0k used, 2097136k free, 786200k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 9939 postgres 18 0 254m 143m 140m R 49.3 2.4 8:35.43 postgres: postgres plate [local] INSERT 9938 postgres 16 0 13720 1440 1120 S 4.9 0.0 0:59.08 psql -d plate -f dump.sql 10738 root 15 0 3988 1120 840 R 4.9 0.0 0:00.05 top -d 0.2 1 root 16 0 640 264 216 S 0.0 0.0 0:05.03 init [3] 2 root 34 19 000 S 0.0 0.0 0:00.00 [ksoftirqd/0] vmstat 1: ClusterNode2 root $ vmstat 1 procs ---memory-- ---swap-- -io --system-- cpu r b swpd free buff cache si sobibo incs us sy id wa 1 0 0 5032012 60888 82100800 216 6938 1952 5049 40 8 15 37 0 1 0 5031392 60892 82163200 0 8152 2126 5725 45 6 0 49 0 1 0 5030896 60900 82214400 0 8124 2052 5731 46 6 0 47 0 1 0 5030400 60908 82276800 0 8144 2124 5717 44 7 0 50 1 0 0 5029904 60924 82327200 0 8304 2062 5763 43 7 0 49 I've read (2004), that Xeon may have problems with content switching - is the problem still existing? Can I do something to minimize the problem? postgresql.conf: shared_buffers = 28672 effective_cache_size = 40 random_page_cost = 2 shmall shmmax are set to 268435456 hdparm: ClusterNode2 root $ hdparm -tT /dev/cciss/c0d0p1 /dev/cciss/c0d0p1: Timing buffer-cache reads: 3772 MB in 2.00 seconds = 1885.34 MB/sec Timing buffered disk reads: 150 MB in 2.06 seconds = 72.72 MB/sec greetings Ernst ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Massive performance issues
On 1-9-2005 19:42, Matthew Sackman wrote: Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? What can I do to make this go faster? I've considered normalising the table but I can't work out whether the slowness is in dereferencing the pointers from the index into the table or in scanning the index in the first place. And normalising the table is going to cause much pain when inserting values and I'm not entirely sure if I see why normalising it should cause a massive performance improvement. In this case, I think normalising will give a major decrease in on-disk table-size of this large table and the indexes you have. If that's the case, that in itself will speed-up all i/o-bound queries quite a bit. locality_1, _2, city and county can probably be normalised away without much problem, but going from varchar's to integers will probably safe you quite a bit of (disk)space. But since it won't change the selectivity of indexes, so you won't get more index-scans instead of sequential scans, I suppose. I think its not that hard to create a normalized set of tables from this data-set (using insert into tablename select distinct ... from address and such, insert into address_new (..., city) select ... (select cityid from cities where city = address.city) from address) So its at least relatively easy to figure out the performance improvement from normalizing the dataset a bit. If you want to improve your hardware, have a look at the Western Digital Raptor-series SATA disks, they are fast scsi-like SATA drives. You may also have a look at the amount of memory available, to allow caching this (entire) table. Best regards, Arjen ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Massive performance issues
Ron [EMAIL PROTECTED] writes: ... Your target is to have each row take = 512B. Ron, are you assuming that the varchar fields are blank-padded or something? I think it's highly unlikely that he's got more than a couple hundred bytes per row right now --- at least if the data is what it sounds like. The upthread comment about strcoll() set off some alarm bells in my head. If the database wasn't initdb'd in C locale already, try making it so. Also, use a single-byte encoding if you can (LatinX is fine, Unicode not). Upgrade pg to 8.0.3 and make sure you have enough RAM for your real day to day load. Newer PG definitely better. Some attention to the configuration parameters might also be called for. I fear though that these things are probably just chipping at the margins ... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
This should be able to run _very_ fast. At 01:42 PM 9/1/2005, Matthew Sackman wrote: Hi, I'm having performance issues with a table consisting of 2,043,133 rows. The schema is: \d address Table public.address Column| Type | Modifiers --++--- postcode_top | character varying(2) | not null postcode_middle | character varying(4) | not null postcode_bottom | character varying(7) | not null postcode | character varying(10) | not null property_type| character varying(15) | not null sale_type| character varying(10) | not null flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Indexes: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) IOW, each row takes ~1KB on HD. First suggestion: format your HD to use 8KB pages with 1KB segments. That'll out each row down on HD as an atomic unit. 8KB pages also play nice with pg. At 1KB per row, this table takes up ~2.1GB and should fit into RAM fairly easily on a decently configured DB server (my _laptop_ has 2GB of RAM after all...) Since you are using ~2.1GB for 2 years worth of data, 15 years worth should take no more than 2.1GB*7.5= 15.75GB. If you replace some of those 100 char fields with integers for code numbers and have an auxiliary table for each of those fields mapping the code numbers to the associated 100 char string, you should be able to shrink a row considerably. Your target is to have each row take = 512B. Once a row fits into one 512B sector on HD, there's a no point in making it smaller unless you can shrink it enough to fit 2 rows into one sector (= 256B). Once two rows fit into one sector, there's no point shrinking a row unless you can make 3 rows fit into a sector. Etc. Assuming each 100 char (eg 100B) field can be replaced with a 4B int, each row could be as small as 76B. That makes 85B per row the goal as it would allow you to fit 6 rows per 512B HD sector. So in the best case your table will be 12x smaller in terms of real HD space. Fitting one (or more) row(s) into one sector will cut down the real space used on HD for the table to ~7.88GB (or 1.32GB in the best case). Any such streamlining will make it faster to load, make the working set that needs to be RAM for best performance smaller, etc, etc. This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. Upgrade pg to 8.0.3 and make sure you have enough RAM for your real day to day load. Frankly, RAM is so cheap ($75-$150/GB), I'd just upgrade the machine to 4GB as a matter of course. P4's have PAE, so if your mainboard can hold it, put more than 4GB of RAM in if you find you need it. Since you are describing your workload as being predominantly reads, you can get away with far less HD capability as long as you crank up RAM high enough to hold the working set of the DB. The indications from the OP are that you may very well be able to hold the entire DB in RAM. That's a big win whenever you can achieve it. After these steps, there may still be performance issues that need attention, but the DBMS should be _much_ faster. Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
At 04:25 PM 9/1/2005, Tom Lane wrote: Ron [EMAIL PROTECTED] writes: ... Your target is to have each row take = 512B. Ron, are you assuming that the varchar fields are blank-padded or something? I think it's highly unlikely that he's got more than a couple hundred bytes per row right now --- at least if the data is what it sounds like. As it stands, each row will take 55B - 748B and each field is variable in size up to the maximums given in the OP's schema. Since pg uses an underlying OS FS, and not a native one, there will be extra FS overhead no matter what we do, particularly to accommodate such flexibility... The goal is to minimize overhead and maximize regularity in layout. The recipe I know for HD IO speed is in keeping the data small, regular, and as simple as possible. Even better, if the table(s) can be made RAM resident, then searches, even random ones, can be very fast. He wants a 1000x performance improvement. Going from disk resident to RAM resident should help greatly in attaining that goal. In addition, by replacing as many variable sized text strings as possible with ints, the actual compare functions he used as examples should run faster as well. The upthread comment about strcoll() set off some alarm bells in my head. If the database wasn't initdb'd in C locale already, try making it so. Also, use a single-byte encoding if you can (LatinX is fine, Unicode not). Good thoughts I hadn't had. Upgrade pg to 8.0.3 and make sure you have enough RAM for your real day to day load. Newer PG definitely better. Some attention to the configuration parameters might also be called for. I fear though that these things are probably just chipping at the margins ... I don't expect 8.0.3 to be a major performance improvement. I do expect it to be a major _maintenance_ improvement for both him and those of us trying to help him ;-) The performance difference between not having the working set of the DB fit into RAM during ordinary operation vs having it be so (or better, having the whole DB fit into RAM during ordinary operation) has been considerably more effective than chipping at the margins IME. Especially so if the HD IO subsystem is wimpy. Ron Peacetree ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) Wow, that's quite a lof of indexes... but your problem isn't reported as being in insert/update/delete. Hah, well now that you mention it. Basically, 100,000 rows come in in a bulk import every month and the only way I can get it to complete in any sane time frame at all is to drop the indexes, do the import and then recreate the indexes. But that's something that I'm OK with - the imports don't have to be that fast and whilst important, it's not *the* critical path. Selection from the database is, hence the indexes. This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. 8.0 or 8.1 might help you some -- better (and more!) disks will probably help a _lot_. Ok, I did try 8.0 when I started this and found that the server bind parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC (various versions I tried)) failed - the parameters were clearly not being substituted. This was Postgresql 8.0 from Debian unstable. That was a couple of weeks ago and I've not been back to check whether its been fixed. Anyway, because of these problems I dropped back to 7.4. Queries such as: select locality_2 from address where locality_2 = 'Manchester'; are taking 14 seconds to complete, and this is only 2 years worth of data - we will have up to 15 years (so over 15 million rows). As Tom pointed out; you're effectively doing random searches here, and using CLUSTER might help. Normalizing your data to get smaller rows (and avoid possibly costly string comparisons if your strcoll() is slow) will probably also help. Ok, so you're saying that joining the address table into an address_city table (the obvious normalization) will help here? The locale settings in postgresql.conf all have en_GB and a \l shows encoding of LATIN1. So I don't think I've set anything to UTF8 or such like. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 This might be a lot quicker than pulling all the records like in your example queries... Yes, that certainly does seem to be the case - around 4 seconds. But I need it to be 10 times faster (or thereabouts) otherwise I have big problems! Many thanks for all the advice so far. Matthew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 10:54:45PM +0200, Arjen van der Meijden wrote: On 1-9-2005 19:42, Matthew Sackman wrote: Obviously, to me, this is a problem, I need these queries to be under a second to complete. Is this unreasonable? What can I do to make this go faster? I've considered normalising the table but I can't work out whether the slowness is in dereferencing the pointers from the index into the table or in scanning the index in the first place. And normalising the table is going to cause much pain when inserting values and I'm not entirely sure if I see why normalising it should cause a massive performance improvement. In this case, I think normalising will give a major decrease in on-disk table-size of this large table and the indexes you have. If that's the case, that in itself will speed-up all i/o-bound queries quite a bit. Well that's the thing - on the queries where it decides to use the index it only reads at around 3MB/s and the CPU is maxed out, whereas when it doesn't use the index, the disk is being read at 60MB/s. So when it decides to use an index, I don't seem to be IO bound at all. Or at least that's the way it seems to me. locality_1, _2, city and county can probably be normalised away without much problem, but going from varchar's to integers will probably safe you quite a bit of (disk)space. Sure, that's what I've been considering today. But since it won't change the selectivity of indexes, so you won't get more index-scans instead of sequential scans, I suppose. I think its not that hard to create a normalized set of tables from this data-set (using insert into tablename select distinct ... from address and such, insert into address_new (..., city) select ... (select cityid from cities where city = address.city) from address) So its at least relatively easy to figure out the performance improvement from normalizing the dataset a bit. Yeah, the initial creation isn't too painful but when adding rows into the address table it gets more painful. However, as I've said elsewhere, the import isn't the critical path so I can cope with that pain, possibly coding around it in a stored proceedure and triggers as suggested. If you want to improve your hardware, have a look at the Western Digital Raptor-series SATA disks, they are fast scsi-like SATA drives. You may also have a look at the amount of memory available, to allow caching this (entire) table. Well I've got 1GB of RAM, but from analysis of its use, a fair amount isn't being used. About 50% is actually in use by applications and about half of the rest is cache and the rest isn't being used. Has this to do with the max_fsm_pages and max_fsm_relations settings? I've pretty much not touched the configuration and it's the standard Debian package. Matthew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
Well I've got 1GB of RAM, but from analysis of its use, a fair amount isn't being used. About 50% is actually in use by applications and about half of the rest is cache and the rest isn't being used. Has this to do with the max_fsm_pages and max_fsm_relations settings? I've pretty much not touched the configuration and it's the standard Debian package. Matt, have a look at the annotated postgresql.conf for 7.x here: http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html If you have the default settings, you're likely hampering yourself quite a bit. You probably care about shared_buffers, sort_mem, vacuum_mem, max_fsm_pages, effective_cache_size Also, you may want to read the PostgreSQL 8.0 Performance Checklist. Even though it's for 8.0, it'll give you good ideas on what to change in 7.4. You can find it here: http://www.powerpostgresql.com/PerfList/ -- Jeff Frost, Owner [EMAIL PROTECTED] Frost Consulting, LLC http://www.frostconsultingllc.com/ Phone: 650-780-7908 FAX: 650-649-1954 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote: Well that's the thing - on the queries where it decides to use the index it only reads at around 3MB/s and the CPU is maxed out, whereas when it doesn't use the index, the disk is being read at 60MB/s. So when it decides to use an index, I don't seem to be IO bound at all. Or at least that's the way it seems to me. You are I/O bound; your disk is doing lots and lots of seeks. The SATA interface is not the bottleneck; the disk's ability to rotate and move its heads is. /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 02:26:47PM -0700, Jeff Frost wrote: Well I've got 1GB of RAM, but from analysis of its use, a fair amount isn't being used. About 50% is actually in use by applications and about half of the rest is cache and the rest isn't being used. Has this to do with the max_fsm_pages and max_fsm_relations settings? I've pretty much not touched the configuration and it's the standard Debian package. Matt, have a look at the annotated postgresql.conf for 7.x here: http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html If you have the default settings, you're likely hampering yourself quite a bit. You probably care about shared_buffers, sort_mem, vacuum_mem, max_fsm_pages, effective_cache_size That's a useful resource, thanks for the pointer. I'll work through that tomorrow. Also, you may want to read the PostgreSQL 8.0 Performance Checklist. Even though it's for 8.0, it'll give you good ideas on what to change in 7.4. You can find it here: http://www.powerpostgresql.com/PerfList/ Thanks, another good resource. I'll work through that too. Matthew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Massive performance issues
At 05:06 PM 9/1/2005, Matthew Sackman wrote: On Thu, Sep 01, 2005 at 10:09:30PM +0200, Steinar H. Gunderson wrote: address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) Wow, that's quite a lof of indexes... but your problem isn't reported as being in insert/update/delete. Hah, well now that you mention it. Basically, 100,000 rows come in in a bulk import every month and the only way I can get it to complete in any sane time frame at all is to drop the indexes, do the import and then recreate the indexes. But that's something that I'm OK with - FTR, this drop the indexes, do foo, recreate the indexes is Industry Standard Practice for bulk inserts/updates/deletes. Regardless of DB product used. - the imports don't have to be that fast and whilst important, it's not *the* critical path. Selection from the database is, hence the indexes. A DB _without_ indexes that fits into RAM during ordinary operation may actually be faster than a DB _with_ indexes that does not. Fitting the entire DB into RAM during ordinary operation if at all possible should be the first priority with a small data mine-like application such as you've described. Also normalization is _not_ always a good thing for data mining like apps. Having most or everything you need in one place in a compact and regular format is usually more effective for data mines than Nth Order Normal Form optimization to the degree usually found in textbooks using OLTP-like examples. Indexes are a complication used as a performance enhancing technique because without them the DB is not performing well enough. IME, it's usually better to get as much performance as one can from other aspects of design and _then_ start adding complications. Including indexes. Even if you fit the whole DB in RAM, you are very likely to need some indexes; but profile your performance first and then add indexes as needed rather than just adding them willy nilly early in the design process. You said you had 1GB of RAM on the machine now. That clearly is inadequate to your desired performance given what you said about the DB. Crank that box to 4GB and tighten up your data structures. Then see where you are. This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. 8.0 or 8.1 might help you some -- better (and more!) disks will probably help a _lot_. Ok, I did try 8.0 when I started this and found that the server bind parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC (various versions I tried)) failed - the parameters were clearly not being substituted. This was Postgresql 8.0 from Debian unstable. That was a couple of weeks ago and I've not been back to check whether its been fixed. Anyway, because of these problems I dropped back to 7.4. Since I assume you are not going to run anything with the string unstable in its name in production (?!), why not try a decent production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a OS more representative of what you are likely (or at least what is safe...) to run in production? Queries such as: select locality_2 from address where locality_2 = 'Manchester'; are taking 14 seconds to complete, and this is only 2 years worth of data - we will have up to 15 years (so over 15 million rows). As Tom pointed out; you're effectively doing random searches here, and using CLUSTER might help. Normalizing your data to get smaller rows (and avoid possibly costly string comparisons if your strcoll() is slow) will probably also help. Ok, so you're saying that joining the address table into an address_city table (the obvious normalization) will help here? The locale settings in postgresql.conf all have en_GB and a \l shows encoding of LATIN1. So I don't think I've set anything to UTF8 or such like. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 This might be a lot quicker than pulling all the records like in your example
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 11:52:45PM +0200, Steinar H. Gunderson wrote: On Thu, Sep 01, 2005 at 10:13:59PM +0100, Matthew Sackman wrote: Well that's the thing - on the queries where it decides to use the index it only reads at around 3MB/s and the CPU is maxed out, whereas when it doesn't use the index, the disk is being read at 60MB/s. So when it decides to use an index, I don't seem to be IO bound at all. Or at least that's the way it seems to me. You are I/O bound; your disk is doing lots and lots of seeks. The SATA interface is not the bottleneck; the disk's ability to rotate and move its heads is. Ahh of course (/me hits head against wall). Because I've /seen/ it read at 60MB/s I was assuming that if it wasn't reading that fast then I'm not IO bound but of course, it's not reading sequentially. That all makes sense. Been a long day etc... ;-) Matthew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote: Selection from the database is, hence the indexes. A DB _without_ indexes that fits into RAM during ordinary operation may actually be faster than a DB _with_ indexes that does not. Fitting the entire DB into RAM during ordinary operation if at all possible should be the first priority with a small data mine-like application such as you've described. That makes sense. Also normalization is _not_ always a good thing for data mining like apps. Having most or everything you need in one place in a compact and regular format is usually more effective for data mines than Nth Order Normal Form optimization to the degree usually found in textbooks using OLTP-like examples. Sure. Ok, I did try 8.0 when I started this and found that the server bind parameters (both via DBD::Pg (with pg_prepare_server = 1) and via JDBC (various versions I tried)) failed - the parameters were clearly not being substituted. This was Postgresql 8.0 from Debian unstable. That was a couple of weeks ago and I've not been back to check whether its been fixed. Anyway, because of these problems I dropped back to 7.4. Since I assume you are not going to run anything with the string unstable in its name in production (?!), why not try a decent production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a OS more representative of what you are likely (or at least what is safe...) to run in production? Well, you see, as ever, it's a bit complicated. The company I'm doing the development for has been subcontracted to do it and the contractor was contracted by the actual client. So there are two companies involved in addition to the client. Sadly, the client actually has dictated things like it will be deployed on FreeBSD and thou shall not argue. At this point in time, I actually have very little information about the specification of the boxen that'll be running this application. This is something I'm hoping to solve very soon. The worst part of it is that I'm not going have direct (ssh) access to the box and all configuration changes will most likely have to be relayed through techies at the client so fine tuning this is going to be a veritable nightmare. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 This might be a lot quicker than pulling all the records like in your example queries... Yes, that certainly does seem to be the case - around 4 seconds. But I need it to be 10 times faster (or thereabouts) otherwise I have big problems! *beats drum* Get it in RAM, Get it in RAM, ... Ok, but I currently have 2 million rows. When this launches in a couple of weeks, it'll launch with 5 million+ and then gain a million a year. I think the upshot of this all is 4GB RAM as a minimum and judicious use of normalization so as to avoid more expensive string comparisons and reduce table size is my immediate plan (along with proper configuration of pg). Matthew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote: address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) This doesn't address the query performance problem, but isn't only one of these indexes necessary? The second one, on all three columns, because searches involving only postcode_top or only postcode_top and postcode_middle could use it, making the indexes on only those columns superfluous. Or am I missing something? -- Michael Fuhr ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Poor performance on HP Package Cluster
Do you have any sources for that information? I am running dual SmartArray 6402's in my DL585 and haven't noticed anything poor about their performance. On Sep 1, 2005, at 2:24 PM, Luke Lonergan wrote: Are you using the built-in HP SmartArray RAID/SCSI controllers? If so, that could be your problem, they are known to have terrible and variable performance with Linux. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [PERFORM] Massive performance issues
At 06:22 PM 9/1/2005, Matthew Sackman wrote: On Thu, Sep 01, 2005 at 06:05:43PM -0400, Ron wrote: Since I assume you are not going to run anything with the string unstable in its name in production (?!), why not try a decent production ready distro like SUSE 9.x and see how pg 8.0.3 runs on a OS more representative of what you are likely (or at least what is safe...) to run in production? Well, you see, as ever, it's a bit complicated. The company I'm doing the development for has been subcontracted to do it and the contractor was contracted by the actual client. So there are two companies involved in addition to the client. Sadly, the client actually has dictated things like it will be deployed on FreeBSD and thou shall not argue. At least get them to promise they will use a release the BSD folks mark stable! At this point in time, I actually have very little information about the specification of the boxen that'll be running this application. This is something I'm hoping to solve very soon. The worst part of it is that I'm not going have direct (ssh) access to the box and all configuration changes will most likely have to be relayed through techies at the client so fine tuning this is going to be a veritable nightmare. IME, what you have actually just said is It will not be possible to safely fine tune the DB unless or until I have direct access; and/or someone who does have direct access is correctly trained. Ick. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 This might be a lot quicker than pulling all the records like in your example queries... Yes, that certainly does seem to be the case - around 4 seconds. But I need it to be 10 times faster (or thereabouts) otherwise I have big problems! *beats drum* Get it in RAM, Get it in RAM, ... Ok, but I currently have 2 million rows. When this launches in a couple of weeks, it'll launch with 5 million+ and then gain a million a year. At my previously mentioned optimum of 85B per row, 2M rows is 170MB. 5M rows is 425MB. Assuming the gain of 1M rows per year, that's +85MB per year for this table. Up to 2GB DIMMs are currently standard, and 4GB DIMMs are just in the process of being introduced. Mainboards with anything from 4 to 16 DIMM slots are widely available. IOW, given the description you've provided this DB should _always_ fit in RAM. Size the production system such that the entire DB fits into RAM during ordinary operation with an extra 1GB of RAM initially tossed on as a safety measure and the client will be upgrading the HW because it's obsolete before they run out of room in RAM. I think the upshot of this all is 4GB RAM as a minimum and judicious use of normalization so as to avoid more expensive string comparisons and reduce table size is my immediate plan (along with proper configuration of pg). My suggestion is only slightly different. Reduce table size(s) and up the RAM to the point where the whole DB fits comfortably in RAM. You've got the rare opportunity to build a practical Memory Resident Database. It should run like a banshee when you're done. I'd love to see the benches on the final product. Ron Peacetree ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Massive performance issues
On Thu, Sep 01, 2005 at 06:42:31PM +0100, Matthew Sackman wrote: flat_extra | character varying(100) | not null number | character varying(100) | not null street | character varying(100) | not null locality_1 | character varying(100) | not null locality_2 | character varying(100) | not null city | character varying(100) | not null county | character varying(100) | not null Having these fixed probably won't give you any noticeable improvements; unless there's something natural about your data setting 100 as a hard limit, you could just as well drop these. address_city_index btree (city) address_county_index btree (county) address_locality_1_index btree (locality_1) address_locality_2_index btree (locality_2) address_pc_bottom_index btree (postcode_bottom) address_pc_middle_index btree (postcode_middle) address_pc_top_index btree (postcode_top) address_pc_top_middle_bottom_index btree (postcode_top, postcode_middle, postcode_bottom) address_pc_top_middle_index btree (postcode_top, postcode_middle) address_postcode_index btree (postcode) address_property_type_index btree (property_type) address_street_index btree (street) street_prefix btree (lower(substring((street)::text, 1, 1))) Wow, that's quite a lof of indexes... but your problem isn't reported as being in insert/update/delete. This is with postgresql 7.4 running on linux 2.6.11 with a 3GHz P4 and a SATA harddrive. 8.0 or 8.1 might help you some -- better (and more!) disks will probably help a _lot_. Queries such as: select locality_2 from address where locality_2 = 'Manchester'; are taking 14 seconds to complete, and this is only 2 years worth of data - we will have up to 15 years (so over 15 million rows). As Tom pointed out; you're effectively doing random searches here, and using CLUSTER might help. Normalizing your data to get smaller rows (and avoid possibly costly string comparisons if your strcoll() is slow) will probably also help. I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 This might be a lot quicker than pulling all the records like in your example queries... /* Steinar */ -- Homepage: http://www.sesse.net/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[PERFORM] Avoid using swap in a cluster
Hi all. In a cluster, is there any way to use the main memory of the other nodes instead of the swap? If I have a query with many sub-queries and a lot of data, I can easily fill all the memory in a node. The point is: is there any way to continue using the main memory from other nodes in the same query instead of the swap? Thank you, Ricardo.
Re: [PERFORM] Massive performance issues
It would be good to see EXPLAIN ANALYZE output for the three queries below (the real vs. estimated row counts being of interest). The number of pages in your address table might be interesting to know too. regards Mark Matthew Sackman wrote (with a fair bit of snippage): explain select locality_2 from address where locality_2 = 'Manchester'; gives QUERY PLAN Seq Scan on address (cost=0.00..80677.16 rows=27923 width=12) Filter: ((locality_2)::text = 'Manchester'::text) explain select locality_1 from address where locality_1 = 'Manchester'; gives QUERY PLAN Index Scan using address_locality_1_index on address (cost=0.00..69882.18 rows=17708 width=13) Index Cond: ((locality_1)::text = 'Manchester'::text) select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [PERFORM] Massive performance issues
Matthew Sackman wrote: I need to get to the stage where I can run queries such as: select street, locality_1, locality_2, city from address where (city = 'Nottingham' or locality_2 = 'Nottingham' or locality_1 = 'Nottingham') and upper(substring(street from 1 for 1)) = 'A' group by street, locality_1, locality_2, city order by street limit 20 offset 0 and have the results very quickly. This sort of query will be handled nicely in 8.1 - it has bitmap and/or processing to make use of multiple indexes. Note that 8.1 is in beta now. Cheers Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [PERFORM] Poor performance on HP Package Cluster
Dan, On 9/1/05 4:02 PM, Dan Harris [EMAIL PROTECTED] wrote: Do you have any sources for that information? I am running dual SmartArray 6402's in my DL585 and haven't noticed anything poor about their performance. I've previously posted comprehensive results using the 5i and 6xxx series smart arrays using software RAID, HW RAID on 3 different kernels, alongside LSI and Adaptec SCSI controllers, and an Adaptec 24xx HW RAID adapter. Results with bonnie++ and simple sequential read/write with dd. I'll post them again here for reference in the next message. Yes, the performance of the SmartArray controllers under Linux was abysmal, even when run by the labs at HP. - Luke ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly